Re: [Rd] combining large list of data.frames

From: Cole Beck <cole.beck_at_vanderbilt.edu>
Date: Mon, 23 Apr 2012 12:07:43 -0500

Thanks Patrick, this is a nice solution. Regarding a patch I'm inclined to believe you're correct, though it is certainly something to consider.

Cheers,
Cole

On 04/20/2012 07:55 PM, Patrick Aboyoun wrote:
> Cole,
> Bioconductor's high throughput sequencing infrastructure package IRanges
> contains code that may be useful for speeding up base::rbind.data.frame.
> I've extracted the salient bits from that rbind method, but left the
> corner case handling code out. IRanges's rbind method took the approach
> of treating a data set as a list of equal length columns, and so it
> contains a number of lapplys and vector concatenation calls. Given that
> base::rbind.data.frame sits at the core of many operations, I'm not sure
> if patches would be accepted for it, but I could take a crack at it.
>
> biocRBind <- function(..., deparse.level=1)
> {
> ## Simplified version of IRanges's rbind method for DataFrame
> ## Removed all data checks, ignored row names
> args <- list(...)
> df <- args[[1L]]
> cn <- colnames(df)
> cl <- unlist(lapply(as.list(df, use.names = FALSE), class))
> factors <- unlist(lapply(as.list(df, use.names = FALSE), is.factor))
> cols <- lapply(seq_len(length(df)), function(i) {
> cols <- lapply(args, `[[`, cn[i])
> if (factors[i]) { # combine factor levels, coerce to character
> levs <- unique(unlist(lapply(cols, levels), use.names=FALSE))
> cols <- lapply(cols, as.character)
> }
> combined <- do.call(c, unname(cols))
> if (factors[i])
> combined <- factor(combined, levs)
> as(combined, cl[i])
> })
> names(cols) <- colnames(df)
> do.call(data.frame, cols)
> }
>
> # create list of data.frames
> set.seed(123)
> dat <- vector("list", 20000)
> for(i in seq_along(dat)) {
> size <- sample(1:30, 1)
> dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size),
> letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE),
> size, replace=TRUE))
> }
>
> # sample runs
> > system.time(do.call(biocRBind, dat))
> user system elapsed
> 2.120 0.000 2.125
> > system.time(do.call(biocRBind, dat))
> user system elapsed
> 2.092 0.000 2.091
> > system.time(do.call(biocRBind, dat))
> user system elapsed
> 2.080 0.000 2.077
> > sessionInfo()
> R Under development (unstable) (2012-04-19 r59111)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> loaded via a namespace (and not attached):
> [1] tools_2.16.0
>
>
> Cheers,
> Patrick
>
>
> On 4/19/2012 3:34 PM, Cole Beck wrote:
>> It's normal for me to create a list of data.frames and then use
>> do.call('rbind', list(...)) to create a single data.frame. However,
>> I've noticed as the size of the list grows large, it is perhaps better
>> to do this in chunks. As an example here's a list of 20,000 similar
>> data.frames.
>>
>> # create list of data.frames
>> dat <- vector("list", 20000)
>> for(i in seq_along(dat)) {
>> size <- sample(1:30, 1)
>> dat[[i]] <- data.frame(id=rep(i, size), value=rnorm(size),
>> letter=sample(LETTERS, size, replace=TRUE), ind=sample(c(TRUE,FALSE),
>> size, replace=TRUE))
>> }
>> # combine into one data.frame, normal usage
>> # system.time(do.call('rbind', dat)) # takes 2-3 minutes
>> combine <- function(x, steps=NA, verbose=FALSE) {
>> nr <- length(x)
>> if(is.na(steps)) steps <- nr
>> while(nr %% steps != 0) steps <- steps+1
>> if(verbose) cat(sprintf("step size: %s\r\n", steps))
>> dl <- vector("list", steps)
>> for(i in seq(steps)) {
>> ix <- seq(from=(i-1)*nr/steps+1, length.out=nr/steps)
>> dl[[i]] <- do.call("rbind", x[ix])
>> }
>> do.call("rbind", dl)
>> }
>> # combine into one data.frame
>> system.time(combine(dat, 100)) # takes 5-10 seconds
>>
>> I'm very surprised by this result. Does this improvement seem
>> reasonable? I would think "do.call" could utilize something similar by
>> default when the length of "args" is too high. Is using "do.call" not
>> recommended in this scenario?
>>
>> Regards,
>> Cole Beck
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 24 Apr 2012 - 11:24:45 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 24 Apr 2012 - 13:50:49 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive