Re: [Rd] row.names in data.frame

From: Don MacQueen <>
Date: Mon 17 Apr 2006 - 15:25:41 GMT

This looks like a good proposal to me, from an end-user's point of view.

I have, from time to time, wished I could set row names to NULL. Not for performance reasons, but because some aspect of my data, in combination with how R handles row names, was requiring me to explicitly manage them in situations where I was otherwise making no use of them. Admittedly, some of these occasions were quite a few R versions ago, when row names were not as carefully managed by R itself as they are now.

Potential ramifications are not immediately obvious to me, but for example, will rbind() of two data frames, both of which have been assigned NULL row names, result in a data frame with NULL row names? (Would it matter?) What about one with NULL row names and one with non-NULL row names?


At 8:29 PM +0100 4/14/06, Prof Brian Ripley wrote:
>We know from the White Book p.57 that the row names of a data frame `are
>never NULL and must be unique'. R documents that row.names() returns a
>character vector, and in R (much more so than on S) a long character
>vector of short unique strings is expensive to store (I saw 72 bytes/row
>on a 64-bit machine for 1:1e6). [Incidentally, in the White Book the
>index page nos are all off by one for this item, and commonly elsewhere.
>It seems to be LaTeX indexing the page on which a para finishes.]
>Last time this came up Martin Maechler asked if we could not do it more
>efficiently, and reminded us recently. It would be fairly easy if
>everyone used the row.names() and row.names<-() accessor functions, but
>some packages (notably Design and Hmisc) access the attribute "row.names"
>directly (and what that is seems to be undocumented).
>I noticed that the White Book does not appear to say that the row names
>are character, and indeed says
> 'If all else fails the row names are just the row numbers.'
>and it seems the author of expand.grid() took that literally, for it used
>to assign integers to the row names. However, the current S-PLUS help for
>both row.names and data.frame say row names are a character vector (and
>that row.names<-() coerces to character).
>We can certainly differentiate between the internal representation and the
>the result of row.names(). Here is my idea:
>1) The internal representation is either NULL, an integer vector or a
>character vector.
>2) attr(x, "row.names") will always return either an integer vector or a
>character vector, using 1:nrow(x) if the internal representation is NULL.
>3) row.names() will always return as.character(attr(x, "row.names)).
>4) attr<- and row.names<- can set NULL, integer or character.
>5) Row-indexing a data frame with NULL or integer representation will give
>an integer representation.
>This would appear to be completely back-compatible for those who only work
>via the accessor functions, and probably work with almost all package code
>that manipulates attributes directly. Since the changes can be done
>almost entirely in C code, the performance hit should be negliglible.
>The benefits will probably only be appreciable with `tall and skinny'
>data frames, as even 72 bytes per row is only going to buy you 9 numeric
>columns. But that is it seems a common enough case to make this
>This would be a change aimed at 2.4.0, since we would need plenty of time
>both for testing and to alter code to make use of the more efficient
>BTW, the maximum object length of 2^31 - 1 ensures that an integer
>representation of row numbers suffices.
>Brian D. Ripley,
>Professor of Applied Statistics,
>University of Oxford, Tel: +44 1865 272861 (self)
>1 South Parks Road, +44 1865 272866 (PA)
>Oxford OX1 3TG, UK Fax: +44 1865 272595
> mailing list

Don MacQueen
Lawrence Livermore National Laboratory
Livermore, CA, USA

______________________________________________ mailing list
Received on Tue Apr 18 01:27:47 2006

This archive was generated by hypermail 2.1.8 : Mon 17 Apr 2006 - 16:17:58 GMT