[Rd] New class: data.table

From: Matthew Dowle <mdowle_at_concordiafunds.com>
Date: Wed 12 Apr 2006 - 14:19:10 GMT


Following previous discussion on this list (http://tolstoy.newcastle.edu.au/R/devel/05/12/3439.html) I have created a package as suggested, and uploaded it to CRAN incoming : data.table.tar.gz.

>From help(data.table) :

This class really does very little. The only reason for its existence is that the white book specifies that data.frame must have rownames.

Most of the code is copied from base functions with the code manipulating row.names removed.

A data.table is identical to a data.frame other than:


nr = 1000000
D = rep(1:5,nr/5)
system.time(DF <<- data.frame(colA=D, colB=D)) # 2.08 system.time(DT <<- data.table(colA=D, colB=D)) # 0.15 (over 10 times faster to create)
identical(as.data.table(DF), DT)

object.size(DF)/object.size(DT)                 # 10 times less memory

tt = subset(DF,colA>3)
ss = DT[colA>3]
identical(as.data.table(tt), ss)


tt = with(subset(DF,colA>3),colA+colB)

ss = with(DT[colA>3],colA+colB)                 # but could be:
DT[colA>3,colA+colB] (not yet implemented) identical(tt, ss)

tt = DF[with(DF,tapply(1:nrow(DF),colB,last)),] # select last row grouping by colB
ss = DT[tapply(1:nrow(DT),colB,last)] # but could be: DT[last,group=colB] (not yet implemented) identical(as.data.table(tt), ss)

tt = DF[with(DF,colA %in% Lkp),]              
ss = DT[colA %in% Lkp]                        # expressions inside the []
can see objects in the calling frame
identical(as.data.table(tt), ss)

In each case above there is either a space, time, or code brevity advantage with the data.table.

The motivation for the new class grew from the realization that performance of data.frames can be improved by removing the rownames. See here for the previous discussion


R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu Apr 13 00:49:50 2006

This archive was generated by hypermail 2.1.8 : Wed 12 Apr 2006 - 16:17:00 GMT