[Rd] split() is slow on data.frame (PR#14123)

From: <pengyu.ut_at_gmail.com>
Date: Wed, 09 Dec 2009 23:10:09 +0100 (CET)


Please see the following code for the runtime comparison between split() and mysplit.data.frame() (they do the same thing semantically). mysplit.data.frame() is a fix of split() in term of performance. Could somebody include this fix (with possible checking for corner cases) in future version of R and let me know the inclusion of the fix?

m=300000
n=6
k=30000

set.seed(0)
x=replicate(n,rnorm(m))
f=sample(1:k, size=m, replace=T)

mysplit.data.frame<-function(x,f) {
  print('processing data.frame')
  v=lapply(

      1:dim(x)[[2]]
      , function(i) {
        split(x[,i],f)

}
) w=lapply( seq(along=v[[1]]) , function(i) { result=do.call( cbind , lapply(v, function(vj) { vj[[i]] } ) ) colnames(result)=colnames(x) return(result)
}
)

  names(w)=names(v[[1]])
  return(w)
}

system.time(split(as.data.frame(x),f))
system.time(mysplit.data.frame(as.data.frame(x),f))



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 09 Dec 2009 - 22:13:37 GMT

This archive was generated by hypermail 2.2.0 : Thu 10 Dec 2009 - 07:11:03 GMT