Re: [Rd] split() is slow on data.frame (PR#14123)

From: Peng Yu <pengyu.ut_at_gmail.com>
Date: Wed, 09 Dec 2009 21:30:23 -0600

I make a version for matrix. Because, it would be more efficient to split each column of a matrix than to convert a matrix to a data.frame then call split() on the data.frame. Note that the version for a matrix and a data.frame is slightly different. Would somebody add this in R as well?

split.matrix<-function(x,f) {
 #print('processing matrix')
 v=lapply(
     1:dim(x)[[2]]
     , function(i) {
       base:::split.default(x[,i],f)#the difference is here
     }
     )

 w=lapply(
     seq(along=v[[1]])
     , function(i) {
       result=do.call(
           cbind
           , lapply(v,
               function(vj) {
                 vj[[i]]
               }
               )
           )
       colnames(result)=colnames(x)
       return(result)
     }
     )
 names(w)=names(v[[1]])
 return(w)
}

On Wed, Dec 9, 2009 at 5:44 PM, Charles C. Berry <cberry_at_tajo.ucsd.edu> wrote:

> On Wed, 9 Dec 2009, William Dunlap wrote:
>
>> Here are some differences between the current and proposed
>> split.data.frame.
>
> Adding 'drop=FALSE' fixes this case. See in line correction below.
>
> Chuck
>
>>
>>> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)),
>>
>> Named=c(one=1,two=2,three=3,four=4,five=5),
>> row.names=as.character(1001:1005))
>>>
>>> group<-c("A","B","A","A","B")
>>> split.data.frame(d,group)
>>
>> $A
>>    Matrix.1 Matrix.2 Named
>> 1001        1        6     1
>> 1003        3        8     3
>> 1004        4        9     4
>>
>> $B
>>    Matrix.1 Matrix.2 Named
>> 1002        2        7     2
>> 1005        5       10     5
>>
>>> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix
>>
>> [1] "processing data.frame"
>> $A
>>    Matrix Named
>> [1,]      1     1
>> [2,]      3     3
>> [3,]      4     4
>>
>> $B
>>    Matrix Named
>> [1,]      2     2
>> [2,]      5     5
>>
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>> -----Original Message-----
>>> From: r-devel-bounces_at_r-project.org
>>> [mailto:r-devel-bounces_at_r-project.org] On Behalf Of
>>> pengyu.ut_at_gmail.com
>>> Sent: Wednesday, December 09, 2009 2:10 PM
>>> To: r-devel_at_stat.math.ethz.ch
>>> Cc: R-bugs_at_r-project.org
>>> Subject: [Rd] split() is slow on data.frame (PR#14123)
>>>
>>> Please see the following code for the runtime comparison between
>>> split() and mysplit.data.frame() (they do the same thing
>>> semantically). mysplit.data.frame() is a fix of split() in term of
>>> performance. Could somebody include this fix (with possible checking
>>> for corner cases) in future version of R and let me know the inclusion
>>> of the fix?
>>>
>>> m=300000
>>> n=6
>>> k=30000
>>>
>>> set.seed(0)
>>> x=replicate(n,rnorm(m))
>>> f=sample(1:k, size=m, replace=T)
>>>
>>> mysplit.data.frame<-function(x,f) {
>>>  print('processing data.frame')
>>>  v=lapply(
>>>      1:dim(x)[[2]]
>>>      , function(i) {
>>>        split(x[,i],f)
>
> Change to:
>
>         split(x[,i,drop=FALSE],f)
>
>
>>>      }
>>>      )
>>>
>>>  w=lapply(
>>>      seq(along=v[[1]])
>>>      , function(i) {
>>>        result=do.call(
>>>            cbind
>>>            , lapply(v,
>>>                function(vj) {
>>>                  vj[[i]]
>>>                }
>>>                )
>>>            )
>>>        colnames(result)=colnames(x)
>>>        return(result)
>>>      }
>>>      )
>>>  names(w)=names(v[[1]])
>>>  return(w)
>>> }
>>>
>>> system.time(split(as.data.frame(x),f))
>>> system.time(mysplit.data.frame(as.data.frame(x),f))
>>>
>>> ______________________________________________
>>> R-devel_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> Charles C. Berry                            (858) 534-2098
>                                            Dept of Family/Preventive
> Medicine
> E mailto:cberry_at_tajo.ucsd.edu               UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
>
>
>

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu 10 Dec 2009 - 03:33:00 GMT

This archive was generated by hypermail 2.2.0 : Thu 10 Dec 2009 - 03:41:01 GMT