Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

From: Matthew Dowle <mdowle_at_mdowle.plus.com>
Date: Tue, 01 May 2012 09:26:31 +0000

Antonio Piccolboni <antonio <at> piccolboni.info> writes:
> Hi,
> I was wondering if there is anything more efficient than split to do the
> kind of conversion in the subject. If I create a data frame as in
>
> system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste("x",
> 1:2000, sep =""))})
> user system elapsed
> 0.004 0.000 0.004
>
> and then I try to split it
>
> > system.time(split(fd, 1:nrow(fd)))
> user system elapsed
> 0.333 0.031 0.415
>
> You will be quick to notice the roughly two orders of magnitude difference
> in time between creation and conversion. Granted, it's not written anywhere
> that they should be similar but the latter seems interpreter-slow to me
> (split is implemented with a lapply in the data frame case) There is also a
> memory issue when I hit about 20000 elements (allocating 3GB when
> interrupted). So before I resort to Rcpp, despite the electrifying feeling
> of approaching the bare metal and for the sake of getting things done, I
> thought I would ask the experts. Thanks
>
> Antonio

Perhaps r-help or Stack Overflow would have been more appropriate to try first, before r-devel. If you did, please say so.

Answering anyway. Do you really want to split every single row? What's the bigger picture? Perhaps you don't need to split at all.

On the off chance that the example was just for exposition, and applying some (biased) guesswork, have you seen the data.table package? It doesn't use the split-apply-combine paradigm because, as your (extreme) example shows, that doesn't scale. When you use the 'by' argument of [.data.table, it allocates memory once for the largest group. Then it reuses that same memory for each group. That's one reason it's fast and memory efficient at grouping (an order of magnitude faster than tapply).

Independent timings :
http://www.r-bloggers.com/comparison-of-ave-ddply-and-data-table/

If you really do want to split every single row, then

    DT[,<something>,by=1:nrow(DT)]
will give perhaps two orders of magnitude speedup, but that's an unfair example because it isn't very realistic. Scaling applies to the size of the data.frame, and, how much you want to split it up. Your example is extreme in the latter but not the former. data.table scales in both.

It's nothing to do with the interpreter, btw, just memory usage.

Matthew



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 01 May 2012 - 09:29:09 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 01 May 2012 - 14:00:52 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive