Re: [Rd] data frame subset patch, take 2

From: Martin Maechler <maechler_at_stat.math.ethz.ch>
Date: Tue 12 Dec 2006 - 17:08:01 GMT

>>>>> "Marcus" == Marcus G Daniels <mgd@santafe.edu> >>>>> on Tue, 12 Dec 2006 09:05:15 -0700 writes:

    Marcus> Vladimir Dergachev wrote:

    >> Here is the second iteration of data frame subset patch.
    >> It now passes make check on both 2.4.0 and 2.5.0 (svn as
    >> of a few days ago).  Same speedup as before.
    >> 

    Marcus> Hi,
    Marcus> I was wondering if this patch would make it into the
    Marcus> next release.  I don't see it in SVN, but it's hard
    Marcus> to be sure because the mailing list apparently
    Marcus> strips attachments.  If it isn't in, or going to be
    Marcus> in, is this patch available somewhere else?

I was wondering too.
      http://www.r-project.org/mail.html
explains what kind of attachments are allowed on R-devel.

I'm particularly interested, since during the last several days I've made (somewhat experimental) changes to R-devel, which makes some dealings with large data frames that have "trivial rownames" (those represented as 1:nrow(.)) much more efficient.

Notably, as.matrix() of such data frames now no longer produces huge row names, and e.g. dim(.) of such data frames has become lightning fast [compared to what it was].

Some measurements:

N <- 1e6
set.seed(1)
## we round (for later dump().. reasons)
x <- round(rnorm(N),2)
y <- round(rnorm(N),2)
mOrig <- cbind(x = x, y = y)
df <- data.frame(x = x, y = y)
mNew <- as.matrix(df)
(sizes <- sapply(list(mOrig=mOrig, df=df, mNew=mNew), object.size))
## R-2.4.0 (64-bit):
## mOrig df mNew
## 16000520 16000776 72000560

## R-2.4.1 beta (32-bit):
## mOrig df mNew
## 16000296 16000448 52000320

## R-pre-2.5.0 (32-bit):
## mOrig df mNew
## 16000296 16000448 16000296

##------------------------------------

N <- 1e6
df <- data.frame(x = 0+ 1:N, y = 1+ 1:N) system.time(for(i in 1:1000) d <- dim(df))

## R-2.4.1 beta (32-bit) [deb1]:
## [1] 1.920 3.748 7.810 0.000 0.000

## R-pre-2.5.0 (32-bit) [deb1]:
## user system elapsed
## 0.012 0.000 0.011

However, currently

  df[2,] ## still internally produces the character(1e6) row names!

something I think we should eliminate as well, i.e., at least make sure that only seq_len(1e6) is internally produced and not the character vector.

Note however that some of these changes are backward incompatible. I do hope that the changes gaining efficiency for such large data frames are worth some adaption of current/old R source code..

Feedback on this topic is very welcome!

Martin



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed Dec 13 20:11:26 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 14 Dec 2006 - 01:31:39 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.