Re: [Rd] speeding up perception

From: Tim Hesterberg <timhesterberg_at_gmail.com>
Date: Mon, 04 Jul 2011 10:38:31 -0700

I've written a "dataframe" package that replaces existing methods for data frame creation and subscripting with versions that use less memory. For example, as.data.frame(a vector) makes 4 copies of the data in R 2.9.2, and 1 copy with the package. There is a small speed gain.

I and others have been using it at Google for some years, and it is time to either put it on CRAN, or move it into R.

R core folks - would you prefer that this be released to CRAN, or would you like to consider merging it directly into R?

I took existing functions, and did some hacks to reduce the number of times R copies objects. Some of it is ugly. This could be done more efficiently, and with cleaner code, with some changes or hooks in R internal code, but I'm not prepared to do that.

I often use lists instead of data frames. In another package I have a 'subscriptRows' function that subscripts a list as if it were a data frame. I could merge that into the dataframe package.

Memory use - number of copies made
# R 2.9.2 library(dataframe)
# as.data.frame(y) 4 1
# data.frame(y) 8 3
# data.frame(y, z) 8 3
# as.data.frame(l) 10 3
# data.frame(l) 15 5
# d$z <- z 3,2 1,1
# d[["z"]] <- z 4,3 2,1
# d[, "z"] <- z 6,4,2 2,2,1
# d["z"] <- z 6,5,2 2,2,1
# d["z"] <- list(z=z) 6,3,2 2,2,1
# d["z"] <- Z #list(z=z) 6,2,2 2,1,1
# a <- d["y"] 2 1
# a <- d[, "y", drop=F] 2 1
# y and z are vectors, Z and l are lists, and d a data frame.
# Where two numbers are given, they refer to:
# (copies of the old data frame),
# (copies of the new column)
# A third number refers to numbers of
# (copies made of an integer vector of row names)

# ------- seconds (multiple repetitions) -------
# creation/column subscripting row subscripting
# R 2.9.2 : 34.2 43.9 43.3 10.6 13.0
# library(dataframe) : 22.5 21.8 21.8 9.7 9.5 9.5

I reported one of the simpler hacks to this list earlier, and it was included in some version of R after 2.9.2, so the current version of R isn't as bad as 2.9.2.



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Mon 04 Jul 2011 - 17:43:02 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 05 Jul 2011 - 06:10:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive