Re: [Rd] colnames slow (PR#10470)

From: <maechler_at_stat.math.ethz.ch>
Date: Wed, 28 Nov 2007 15:55:14 +0100 (CET)


>>>>> "UweL" == Uwe Ligges <ligges_at_statistik.uni-dortmund.de> >>>>> on Mon, 26 Nov 2007 22:14:07 +0100 writes:

    UweL> tomas.larsson_at_gm.com wrote:
>> Full_Name: Tomas Larsson
>> Version: 2.6.0
>> OS: Windows XP
>> Submission from: (NULL) (198.208.251.24)
>>
>>
>> This is not a bug, it is a performance issue but I think it should have an easy
>> fix.
>>
>> I have a large matrix (about 2,000,000 by 20), when I type colnames(x) it takes
>> a long time to get the result. However, if I select just the first couple of
>> rows of the matrix I don't have to wait for the result. See below for example.

    >>> system.time(colnames(x))

>> user system elapsed
>> 9.98 0.00 10.00
    >>> system.time(colnames(x[1:2,]))

>> user system elapsed
>> 0.01 0.00 0.02

    UweL> Documentation in the released version of R (2.6.1) tells us:

    UweL> For a data frame, 'rownames'
    UweL> and 'colnames' are calls to 'row.names' and 'names' respectively,
    UweL> but the latter are preferred.

aaah, so we do have a something close to a bug, since the above is only correct if "are" is interpreted in quite a wide sense :

Both colnames() and rownames() call dimnames() which is a (.Primitive) generic and the data.frame method simply is

    function (x) list(row.names(x), names(x)) So, in fact, both colnames() and rownames() each call *both* row.names() and names() even though only one of them is needed.

This is indeed suboptimal in the case where colnames() is of length 20 and rownames() is of length 2'000'000 ... not really an atypical case.

And that (last paragraph) is also true when 'x' is a matrix and not a data.frame. However, there's a bit more to it...

    UweL> and on my machine I get:

    UweL> system.time(names(x))
    UweL> user  system elapsed
    UweL> 0       0       0

yes. But what if his 'x' really *was* a matrix (and not a data frame)?

The speed of colnames(x) in such a case depends quite a bit if the matrix has non-NULL rownames. Ideally I think it should not, and hence partly agree with Tomas.

HOWEVER, there's more to it.
If 'x' *was* a matrix --- and this proves to me that it was not

                                        in Tomas' case --- even though colnames() seems like a waste (of memory, copying), it is infact still very fast in newer versions of R ... most probably because 'character' vectors are hashed now and much less memory allocation is happening than in earlier versions of R.

The only case that is slow is for a *data frame* with "empty" i.e. automatic rownames. Watch this :

m <- matrix(pi, 2e6, 20,

            dimnames=list(LETTERS[sample(26,2e6,replace=TRUE)], letters[1:20])) system.time(for(i in 1:100) cc <- colnames(m)) ## 0.001 -- very fast
## ditto for this:
system.time(for(i in 1:100) dd <- dimnames(m))

## HOWEVER:
system.time(dm <- as.data.frame(m)) ## takes more than a second ## user system elapsed
## 2.462 1.379 3.842

## Quite a bit slower (x 1000 !) than for the matrix above, but still ok: system.time(for(i in 1:100) c2 <- colnames(dm)) ## user system elapsed
## 1.202 0.638 1.842
stopifnot(identical(c2, cc))

## ditto
system.time(for(i in 1:100) d2 <- dimnames(dm)) ## user system elapsed
## 1.143 0.626 1.769
stopifnot(identical(d2, dd))

###---- BUT now: What happens if we have "empty" rownames ???

## m0 := {m with empty rownames} :
m0 <- m
dimnames(m0) <- list(NULL, colnames(m0))

## and ditto for the data frames:
## dm0 := {dm with empty rownames, i.e. "internal/automatic 1:N rownames}: system.time(dm0 <- as.data.frame(m0))
## user system elapsed
## 1.677 1.241 2.922

system.time(c3 <- colnames(dm0))
## user system elapsed
## 5.208 0.047 5.261

###---> OOOPS! One single call to colnames(.) ### needs more than 100 calls in the non-empty rownames case

## repeated calls become faster ..... and .... system.time(c3 <- colnames(dm0))

##    user  system elapsed
##   3.109   0.000   3.110
## ..... faster  ......and even much faster
system.time(c3 <- colnames(dm0))
## user system elapsed
## 0.913 0.007 0.922

## Note: repeated calls to dimnames(.) here become faster : system.time(d3 <- dimnames(dm0))

## Note indeed, that names() is lightning fast in comparison: system.time(for(i in 1:100) c4 <- names(dm0)) ## is 'immediate' (0 sec) ## user system elapsed
## 0.001 0.000 0.000 --- 100 x ~1000 times faster


All things considered, I'd currently propose to add

    if(is.data.frame(x) && do.NULL)

        return(names(x))

to the beginning of 'colnames'.
We have such clause already at the beginning of 'colnames<-'
.... all of which would suggest to make these generic, but we have been there before and consciously decided against doing so, on the ground that 'dimnames' is already generic and colnames(.) and rownames(.) should really be equivalent to dimnames(.)[[j]] for j=1 or 2, respectively.

Martin Maechler, ETH Zurich



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 28 Nov 2007 - 15:01:29 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 28 Nov 2007 - 15:30:36 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.