From: Simon Urbanek <simon.urbanek_at_r-project.org>

Date: Sun, 03 Jul 2011 10:26:03 -0400

> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))

> system.time( example2(m) )

> system.time( example3(m) )

*>
*

> I tried to find some documentation on why there needs to be extra row names handling when one is just assigning values into the column of a data frame, but couldn't find any. For a while I stared at the code of `[<-.data.frame` but couldn't figure out it myself. Can you please summarise what exactly is going one when one does m[1, 1] <- 1 where m is a data frame?

*>
*

*> I found that the performance is significantly different with different number of columns. For instance
*

*>
*

*> # reassign first column to 1
*

*> example <- function(m){
*

*> for(i in 1:1000)
*

*> m[i,1] <- 1
*

*> }
*

*>
*

*> m <- as.data.frame(matrix(0, ncol=2, nrow=1000))
*

*> system.time( example(m) )
*

*>
*

*> user system elapsed
*

*> 0.164 0.000 0.163
*

*>
*

*> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
*

*> system.time( example(m) )
*

*>
*

*> user system elapsed
*

*> 34.634 0.004 34.765
*

*>
*

*> When m is a matrix, both run well under 0.1s.
*

*>
*

*> Increasing the number of rows (but not the number of iterations) leads to some increase in time, but not as drastic when increasing column number. Using m[[y]][x] in this case doesn't help either.
*

*>
*

*> example2 <- function(m){
*

*> for(i in 1:1000)
*

*> m[[1]][i] <- 1
*

*> }
*

*>
*

*> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))
*

*> system.time( example2(m) )
*

*>
*

*> user system elapsed
*

*> 36.007 0.148 36.233
*

*>
*

*>
*

*> r.
*

*>
*

*> ______________________________________________
*

*> R-devel_at_r-project.org mailing list
*

*> https://stat.ethz.ch/mailman/listinfo/r-devel
*

*>
*

*>
*

R-devel_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun 03 Jul 2011 - 14:35:03 GMT

Date: Sun, 03 Jul 2011 10:26:03 -0400

Robert,

it's not the handling of row names per se that causes the slowdown, but my point was that if what you need is just matrix-like structure with different column types, you may want to use lists instead and for equal column types you're better of with a matrix.

But to address your point, one of the reasons for subassignments on data frames being slow is that they need extra copies of the data frame for method dispatch. Data frames are lists of column vectors, so the penalty is worse with increasing number of columns. Rows play no significant (additional) role, because those are simply operations on the column vectors (they need to be copied on modification in any case).

In practice it would not matter as much unless the users do stupid things like the example loop. In that case the list holding the columns is copied twice for every single value of i which is deadly. Obviously the sensible thing to do m[1:1000,1] <- 1 does not have that issue.

So to illustrate part of the data.frame penalty effect consider simply falling back to lists in the assignment:

+ for(i in 1:1000) + m[[1]][i] <- 1 + }

> m <- as.data.frame(matrix(0, ncol=1000, nrow=1000))

> system.time( example2(m) )

user system elapsed

44.359 13.608 58.011

*> ### using a list is very fast as illustrated before:
*

> m <- as.list(as.data.frame(matrix(0, ncol=1000, nrow=1000)))

> system.time( example2(m) )

user system elapsed

0.01 0.00 0.01

> ### now try to fall back to a list for each iteration (part of what the data frames have to do):

> example3 <- function(m){

+ for(i in 1:1000) { + oc <- class(m) + class(m) <- NULL + m[[1]][i] <- 1 + class(m) <- oc + } + }

> system.time( example3(m) )

user system elapsed

19.080 2.251 21.335

So just the simple fact that you unclass and re-class the object gives you half of the penalty that data.frames incur even if you're dealing with a list. Add the additional logic that data frames have to go through and you have the full picture.

So, as I was saying earlier, if you want to loop subassignments over many elements: don't do that in the first place, but if you do, use lists or matrices, NOT data frames.

Cheers,

Simon

On Jul 3, 2011, at 8:13 AM, Robert Stojnic wrote:

*>
*

> Hi Simon,

*>
**> On 03/07/11 05:30, Simon Urbanek wrote:
*

>> This is just a quick, incomplete response, but the main misconception is really the use of data.frames. If you don't use the elaborate mechanics of data frames that involve the management of row names, then they are definitely the wrong tool to use, because most of the overhead is exactly to manage to row names and you pay a substantial penalty for that. Just drop that one feature and you get timings similar to a matrix:

> I tried to find some documentation on why there needs to be extra row names handling when one is just assigning values into the column of a data frame, but couldn't find any. For a while I stared at the code of `[<-.data.frame` but couldn't figure out it myself. Can you please summarise what exactly is going one when one does m[1, 1] <- 1 where m is a data frame?

R-devel_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun 03 Jul 2011 - 14:35:03 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

*
Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.
Archive generated by hypermail 2.2.0, at Mon 04 Jul 2011 - 12:20:07 GMT.
*

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel.
Please read the posting
guide before posting to the list.
*