Re: [Rd] (PR#8192) [ subscripting sometimes loses names

From: Tim Hesterberg <TimHesterberg_at_gmail.com>
Date: Sun, 01 Feb 2009 09:25:50 -0800

>...
>Simon, no, the drop=FALSE argument has nothing to do with what
>Christian was talking about. The kind of thing he meant is PR# 8192,
>"Subject: [ subscripting sometimes loses names":
>
> http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192
>
>In R, subscripting with "[" USUALLY retains names, but R has various
>edge cases where it (IMNSHO) inappropriately discards them. This
>occurs with both .Primitive("[") and "[.data.frame". This has been
>known for years, but I have not yet tried digging into R's
>implementation to see where and how the names are actually getting
>lost.
>
>Incidentally, versions of S-Plus since approximately S-Plus 6.0 back
>in 2001 show similar buggy edge case behavior. Older versions of
>S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving
>behavior. I presume that the original Bell Labs S had correct
>name-preserving behavior, and then the S-Plus developers broke it
>sometime along the way.

(Later comments on the thread pointed out the difference between x[,1] for matrices and data frames.)

I rewrote the S-PLUS data frame code around then, to fix various inconsistencies and improve efficiency. This was probably my change, and I would do it again.

Note that the components of a data frame do not have names attached to them; the row names are a separate object. Extracting a component vector or matrix from a data frame should not attach names to the result, because of:
* memory (attaching row names to an object can more than double the   size of the object),
* speed
* some objects cannot take names, and attaching them could change   the class and other behavior of an object, and * the names are usually/often (depending on the user) meaningless,   artifacts of an early design decision that all data frames have row names.

Data frames differ from matrices in two ways that matter here: * columns in matrices are all the same kind, and are simple objects   (numeric, etc.), whereas components of data frames can be nearly   arbitrary objects, and
* row names get added to a data frame whether a user wants them or not,   whereas row names on a matrix have to be specified.

A historical note - unique row names on data frame were a design decision made when people worked with small data frames, and are convenient for small data frames. But they are a problem for large data frames. I was writing for all users, not just those with small data frames and meaningful names.

I like R's 'automatic' row names. This is a big help working with huge data frames (and I do this often, at Google). But this doesn't go far enough; subscripting and other operations sometimes convert the automatic names to real names, and check/enforce uniqueness, which is a big waste of time when working with large data frames. I'll comment more on this in a new thread.

Tim Hesterberg



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun 01 Feb 2009 - 17:29:46 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 02 Feb 2009 - 12:30:25 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive