[Rd] combining large list of data.frames

From: Cole Beck <cole.beck_at_vanderbilt.edu>
Date: Thu, 19 Apr 2012 17:34:29 -0500

It's normal for me to create a list of data.frames and then use do.call('rbind', list(...)) to create a single data.frame. However, I've noticed as the size of the list grows large, it is perhaps better to do this in chunks. As an example here's a list of 20,000 similar data.frames.

# create list of data.frames

dat <- vector("list", 20000)
for(i in seq_along(dat)) {

   size <- sample(1:30, 1)
   dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size), letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE), size, replace=TRUE))
# combine into one data.frame, normal usage
# system.time(do.call('rbind', dat)) # takes 2-3 minutes
combine <- function(x, steps=NA, verbose=FALSE) {

   nr <- length(x)
   if(is.na(steps)) steps <- nr
   while(nr %% steps != 0) steps <- steps+1    if(verbose) cat(sprintf("step size: %s\r\n", steps))    dl <- vector("list", steps)
   for(i in seq(steps)) {

     ix <- seq(from=(i-1)*nr/steps+1, length.out=nr/steps)
     dl[[i]] <- do.call("rbind", x[ix])

   do.call("rbind", dl)
# combine into one data.frame

system.time(combine(dat, 100)) # takes 5-10 seconds

I'm very surprised by this result. Does this improvement seem reasonable? I would think "do.call" could utilize something similar by default when the length of "args" is too high. Is using "do.call" not recommended in this scenario?

Cole Beck

R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Fri 20 Apr 2012 - 06:59:11 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 21 Apr 2012 - 13:30:48 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive