Re: [Rd] split() is slow on data.frame (PR#14123)

From: William Dunlap <wdunlap_at_tibco.com>
Date: Wed, 09 Dec 2009 14:26:15 -0800

Here are some differences between the current and proposed split.data.frame.

> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)),
Named=c(one=1,two=2,three=3,four=4,five=5), row.names=as.character(1001:1005))
> group<-c("A","B","A","A","B")
> split.data.frame(d,group)
$A

     Matrix.1 Matrix.2 Named
1001        1        6     1
1003        3        8     3
1004        4        9     4

$B
     Matrix.1 Matrix.2 Named
1002        2        7     2
1005        5       10     5

> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix
[1] "processing data.frame"
$A

     Matrix Named
[1,]      1     1
[2,]      3     3
[3,]      4     4

$B
     Matrix Named
[1,]      2     2
[2,]      5     5


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-devel-bounces_at_r-project.org
> [mailto:r-devel-bounces_at_r-project.org] On Behalf Of
> pengyu.ut_at_gmail.com
> Sent: Wednesday, December 09, 2009 2:10 PM
> To: r-devel_at_stat.math.ethz.ch
> Cc: R-bugs_at_r-project.org
> Subject: [Rd] split() is slow on data.frame (PR#14123)
>
> Please see the following code for the runtime comparison between
> split() and mysplit.data.frame() (they do the same thing
> semantically). mysplit.data.frame() is a fix of split() in term of
> performance. Could somebody include this fix (with possible checking
> for corner cases) in future version of R and let me know the inclusion
> of the fix?
>
> m=300000
> n=6
> k=30000
>
> set.seed(0)
> x=replicate(n,rnorm(m))
> f=sample(1:k, size=m, replace=T)
>
> mysplit.data.frame<-function(x,f) {
> print('processing data.frame')
> v=lapply(
> 1:dim(x)[[2]]
> , function(i) {
> split(x[,i],f)
> }
> )
>
> w=lapply(
> seq(along=v[[1]])
> , function(i) {
> result=do.call(
> cbind
> , lapply(v,
> function(vj) {
> vj[[i]]
> }
> )
> )
> colnames(result)=colnames(x)
> return(result)
> }
> )
> names(w)=names(v[[1]])
> return(w)
> }
>
> system.time(split(as.data.frame(x),f))
> system.time(mysplit.data.frame(as.data.frame(x),f))
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
>
https://stat.ethz.ch/mailman/listinfo/r-devel
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 09 Dec 2009 - 22:31:33 GMT

This archive was generated by hypermail 2.2.0 : Thu 10 Dec 2009 - 00:01:11 GMT