[Rd] Subsetting a data frame vs. subsetting the columns

From: Hadley Wickham <hadley_at_rice.edu>
Date: Wed, 28 Dec 2011 09:37:01 -0600


ink1">ink1">Hi all,

There seems to be rather a large speed disparity in subsetting when working with a whole data frame vs. working with just columns individually:

df <- as.data.frame(replicate(10, runif(1e5))) ord <- order(df[[1]])

system.time(df[ord, ])
# user system elapsed
# 0.043 0.007 0.059

system.time(lapply(df, function(x) x[ord]))
# user system elapsed
# 0.022 0.008 0.029

What's going on?

I realise this isn't quite a fair example because the second case makes a list not a data frame, but I thought it would be quick operation to turn a list into a data frame if you don't do any checking:

list_to_df <- function(list) {
  n <- length(list[[1]])
  structure(list,
    class = "data.frame",
    row.names = c(NA, -n))
}
system.time(list_to_df(lapply(df, function(x) x[ord])))
# user system elapsed
# 0.031 0.017 0.048

So I guess this is slow because it has to make a copy of the whole data frame to modify the structure. But couldn't [.data.frame avoid that?

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Wed 28 Dec 2011 - 15:39:33 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 28 Dec 2011 - 17:30:21 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive