Re: [Rd] split() is slow on data.frame (PR#14123)

From: Charles C. Berry <cberry_at_tajo.ucsd.edu>
Date: Wed, 09 Dec 2009 15:44:29 -0800

On Wed, 9 Dec 2009, William Dunlap wrote:

> Here are some differences between the current and proposed
> split.data.frame.

Adding 'drop=FALSE' fixes this case. See in line correction below.

Chuck

>
>> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)),
> Named=c(one=1,two=2,three=3,four=4,five=5),
> row.names=as.character(1001:1005))
>> group<-c("A","B","A","A","B")
>> split.data.frame(d,group)
> $A
> Matrix.1 Matrix.2 Named
> 1001 1 6 1
> 1003 3 8 3
> 1004 4 9 4
>
> $B
> Matrix.1 Matrix.2 Named
> 1002 2 7 2
> 1005 5 10 5
>
>> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix
> [1] "processing data.frame"
> $A
> Matrix Named
> [1,] 1 1
> [2,] 3 3
> [3,] 4 4
>
> $B
> Matrix Named
> [1,] 2 2
> [2,] 5 5
>
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>> -----Original Message-----
>> From: r-devel-bounces_at_r-project.org
>> [mailto:r-devel-bounces_at_r-project.org] On Behalf Of
>> pengyu.ut_at_gmail.com
>> Sent: Wednesday, December 09, 2009 2:10 PM
>> To: r-devel_at_stat.math.ethz.ch
>> Cc: R-bugs_at_r-project.org
>> Subject: [Rd] split() is slow on data.frame (PR#14123)
>>
>> Please see the following code for the runtime comparison between
>> split() and mysplit.data.frame() (they do the same thing
>> semantically). mysplit.data.frame() is a fix of split() in term of
>> performance. Could somebody include this fix (with possible checking
>> for corner cases) in future version of R and let me know the inclusion
>> of the fix?
>>
>> m=300000
>> n=6
>> k=30000
>>
>> set.seed(0)
>> x=replicate(n,rnorm(m))
>> f=sample(1:k, size=m, replace=T)
>>
>> mysplit.data.frame<-function(x,f) {
>> print('processing data.frame')
>> v=lapply(
>> 1:dim(x)[[2]]
>> , function(i) {
>> split(x[,i],f)

Change to:

          split(x[,i,drop=FALSE],f)

>> }
>> )
>>
>> w=lapply(
>> seq(along=v[[1]])
>> , function(i) {
>> result=do.call(
>> cbind
>> , lapply(v,
>> function(vj) {
>> vj[[i]]
>> }
>> )
>> )
>> colnames(result)=colnames(x)
>> return(result)
>> }
>> )
>> names(w)=names(v[[1]])
>> return(w)
>> }
>>
>> system.time(split(as.data.frame(x),f))
>> system.time(mysplit.data.frame(as.data.frame(x),f))
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry_at_tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901

R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 09 Dec 2009 - 23:51:41 GMT

This archive was generated by hypermail 2.2.0 : Thu 10 Dec 2009 - 00:21:01 GMT