Re: [Rd] combining large list of data.frames

From: Patrick Aboyoun <patrick.aboyoun_at_oracle.com>
Date: Fri, 20 Apr 2012 17:55:15 -0700

Cole,
Bioconductor's high throughput sequencing infrastructure package IRanges contains code that may be useful for speeding up base::rbind.data.frame. I've extracted the salient bits from that rbind method, but left the corner case handling code out. IRanges's rbind method took the approach of treating a data set as a list of equal length columns, and so it contains a number of lapplys and vector concatenation calls. Given that base::rbind.data.frame sits at the core of many operations, I'm not sure if patches would be accepted for it, but I could take a crack at it.

biocRBind <- function(..., deparse.level=1) {

   ## Simplified version of IRanges's rbind method for DataFrame    ## Removed all data checks, ignored row names    args <- list(...)

   df <- args[[1L]]
   cn <- colnames(df)
   cl <- unlist(lapply(as.list(df, use.names = FALSE), class))
   factors <- unlist(lapply(as.list(df, use.names = FALSE), is.factor))    cols <- lapply(seq_len(length(df)), function(i) {
     cols <- lapply(args, `[[`, cn[i])
     if (factors[i]) { # combine factor levels, coerce to character
       levs <- unique(unlist(lapply(cols, levels), use.names=FALSE))
       cols <- lapply(cols, as.character)
     }
     combined <- do.call(c, unname(cols))
     if (factors[i])
       combined <- factor(combined, levs)
     as(combined, cl[i])

   })
   names(cols) <- colnames(df)
   do.call(data.frame, cols)
}

# create list of data.frames
set.seed(123)
dat <- vector("list", 20000)
for(i in seq_along(dat)) {

   size <- sample(1:30, 1)
   dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size), letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE), size, replace=TRUE))
}

# sample runs
> system.time(do.call(biocRBind, dat))

    user system elapsed
   2.120 0.000 2.125
> system.time(do.call(biocRBind, dat))

    user system elapsed
   2.092 0.000 2.091
> system.time(do.call(biocRBind, dat))

    user system elapsed
   2.080 0.000 2.077
> sessionInfo()

R Under development (unstable) (2012-04-19 r59111) Platform: x86_64-unknown-linux-gnu (64-bit)

locale:

  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=C                 LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] tools_2.16.0

Cheers,
Patrick

On 4/19/2012 3:34 PM, Cole Beck wrote:

> It's normal for me to create a list of data.frames and then use 
> do.call('rbind', list(...)) to create a single data.frame.  However, 
> I've noticed as the size of the list grows large, it is perhaps better 
> to do this in chunks.  As an example here's a list of 20,000 similar 
> data.frames.
>
> # create list of data.frames
> dat <- vector("list", 20000)
> for(i in seq_along(dat)) {
>   size <- sample(1:30, 1)
>   dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size), 
> letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE), 
> size, replace=TRUE))
> }
> # combine into one data.frame, normal usage
> # system.time(do.call('rbind', dat)) # takes 2-3 minutes
> combine <- function(x, steps=NA, verbose=FALSE) {
>   nr <- length(x)
>   if(is.na(steps)) steps <- nr
>   while(nr %% steps != 0) steps <- steps+1
>   if(verbose) cat(sprintf("step size: %s\r\n", steps))
>   dl <- vector("list", steps)
>   for(i in seq(steps)) {
>     ix <- seq(from=(i-1)*nr/steps+1, length.out=nr/steps)
>     dl[[i]] <- do.call("rbind", x[ix])
>   }
>   do.call("rbind", dl)
> }
> # combine into one data.frame
> system.time(combine(dat, 100)) # takes 5-10 seconds
>
> I'm very surprised by this result.  Does this improvement seem 
> reasonable?  I would think "do.call" could utilize something similar 
> by default when the length of "args" is too high.  Is using "do.call" 
> not recommended in this scenario?
>
> Regards,
> Cole Beck
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel 

	[[alternative HTML version deleted]]

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat 21 Apr 2012 - 01:00:19 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 24 Apr 2012 - 13:00:48 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive