Re: [Rd] data frame subset patch, take 2

From: Vladimir Dergachev <vdergachev_at_rcgardis.com>
Date: Wed 13 Dec 2006 - 18:03:21 GMT

On Wednesday 13 December 2006 6:01 am, Martin Maechler wrote:
>
> - Vladimir, have you verified your 'take2' against recent versions
> of R-devel?

Yes.

>
> - If they still work, could you re-post them to R-devel, this
> time using a proper MIME type,
> i.e. most probably one of
> application/x-tar
> application/x-compressed-tar
> application/x-gzip
>
> In case you don't know how to achieve this,
> I'd be interested to get it by "private" e-mail.

No problem. The old e-mail did have a mime type: "text/x-diff". I am resending the patch - now compressed, hopefully it will get pass whatever filters are in place.

With regard to speedups in R, here is my wish list - I would greatly appreciate comments on what makes sense here or not, etc:

  1. I greatly miss equivalents of Tcl append and lappend commands - not the function performed by these commands but efficiency (they are O(1) on average). Tcl easily handles lists with 1e6 components and strings of 10s of megabytes in length.
  2. It would be nice to have true hashed arrays in R (i.e. O(1) access times). So far I have used named lists for this, but they are O(n):

> L<-list(); system.time(for(i in 1:10000)L[[paste(i)]]<-i);
[1] 2.864 0.004 2.868 0.000 0.000
> L<-list(); system.time(for(i in 1:20000)L[[paste(i)]]<-i);
[1] 11.789 0.216 12.004 0.000 0.000

    3. Efficient manipulation of large numbers of strings. The big reason character row.names are slow is because they require a large number of string objects that slow down garbage collector.

    This is possibly not a problem that has an easy solution, here are a couple of approaches I have considered:

  1. Inline strings - use a structure like union { struct { unsigned char size; char body[15]; } inlined_string; /* use this when size<16 */
			struct {
				unsigned char flag;  
				char reserved[7]; /* for 64 bit */
				CHRSXP ptr;
				} indirect_string; /* use this when flag=255 */
			}

             This basically turns small strings into an enum type stored 
within a 128-bit integer. This would greatly decrease required number of CHRSXP in many common cases (in particular for many rownames).

             The biggest disadvantage is more complicated access to string data. Also this does not solve the issue of how to deal with 1e6 long strings - though I feel like 15 characters should be good enough for most uses.

        b) CHRSXPs are always leaf nodes. One could implement true reference counting and create a separate garbage collector pool for them. This way one can rely on reference counting to free string objects during normal operation, but also keep track of the number of referenced strings during garbage collector passes - and trigger string garbage collection passes (with a warning) when the number of referenced strings is much smaller the number of objects in string pool.

            This gets rid of overhead that strings impose on garbage collector. The disadvantage are very large changes to R code.

                            best

                                Vladimir Dergachev

______________________________________________

R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Thu Dec 14 17:24:07 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sun 17 Dec 2006 - 03:30:55 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.