From: Vadim Ogranovich <vograno_at_evafunds.com>
Date: Fri 26 Nov 2004 - 10:31:07 EST


As far as I can tell data.frame class adds two features to those of lists:
* matrix structure via [,] and [,]<- operators (well, I know these are actually "["(i, j, ...), not "[,]").
* row names attribute.  

It seems that the overhead of the support for the row names, both computational and RAM-wise, is rather non-trivial. I frequently subscript from a data.frame, i.e. use [,] on data frames, and my timing shows that the equivalent list operation is about 7 times faster, see below.  

On the other hand, at least in my usage pattern, I really rarely benefit from the row names attribute, so as far as I am concerned row names is just an overhead. (Of course the speed difference may be due to other factors, the only thing I can tell is that subscripting is very slow in data frames relative to in lists).  

I thought of writing a new class, say lightweight.data.frame, that would be polymorphic with the existing data.frame class. The class would inherit from "list" and implement [,], [,]<- operators. It would also implement the "rownames" function that would return seq(nrow(x)), etc. It should also implement as.data.frame to avoid the overhead of conversion to a full-blown data.frame in calls like lm(y ~ x, data=myLightweightDataframe).  

Has anyone thought of this? Can you see any potential problems?  


P.S. These are the timing results comparing data.frame operations to those of lists

# make a 1e6 * 5 list
> system.time(x <- lapply(seq(5), function(x) rnorm(1e6)))
[1] 4.46 0.10 4.57 0.00 0.00
# convert it to a data.frame
> system.time(y <- as.data.frame(x))

[1] 49.17 1.25 50.61 0.00 0.00
# do an equivalent of x[-1,] on the list
> i <- seq(2, nrow(y)); system.time(x.sub <- lapply(x, function(x)
[1] 0.19 0.15 0.35 0.00 0.00
# do an equivalent of x[-1,] on the data.frame
> i <- seq(2, nrow(y)); system.time(y.sub <- y[i,])
[1] 2.08 0.56 2.64 0.00 0.00
> 2.64/0.35

[1] 7.542857

