Re: [Rd] Some R questions

From: miguel manese <jjonphl_at_gmail.com>
Date: Wed 01 Nov 2006 - 02:30:55 GMT

Hi,

Had experience with this on doing SQLiteDF...

On 11/1/06, Vladimir Dergachev <vdergachev@rcgardis.com> wrote:
> Hi all,
>
> I am working with some large data sets (1-4 GB) and have some questions
> that I hope someone can help me with:
>
> 1. Is there a way to turn off garbage collector from within C interface ?
> what I am trying to do is suck data from mysql (using my own C
> functions) and I see that allocating each column (with about 1-4 million
> items) takes between 0.5 and 1 seconds. My first thought was that it
> would be nice to turn off garbage collector, allocate all the data,
> copy values and then turn the garbage collector back on.
I believe not. FWIW a numeric() vector is a chunk of memory with a VECTOR_SEXP header and then your data contiguously allocated. If you are desparate enough and assuming the garbage collector is indeed the culprit, you may want to implement your own lightweight allocVector (the function expanded to by NEW_NUMERIC(), etc.)

> 2. For creating STRSXP should I be using mkChar() or mkString() to create
> element values ? Is there a way to do it without allocating a cons cell ?
> (otherwise a single STRSXP with 1e6 length slows down garbage collector)
A string vector (STRSXP) is composed of CHARSXP's. mkChar makes ar CHARSXP, and mkString makes a STRSXP with 1 CHARSXP, more like a shorthand for

SEXP str = NEW_CHARACTER(1);
SET_STRING_ELT(str, 0, mkChar("foo"));

> 3. Is "row.names" attribute required for data frames and, if so, can I
> use some other type besides STRSXP ?
It is required. It can be integers, for 2.4.0+

> 4. While poking around to find out why some of my code is excessively slow
> I have come upon definition of `[.data.frame` - subscription operator
> for data frames, which appears to be written in R. I am wondering whether
> I am looking at the right place and whether anyone would be interested in
> a piece of C code optimizing it - in particular extraction of single element
> is quite slow (i.e. calls like T[i, j]).
[.data.frame is such a pain to implement because there is just too many ways to index a data frame. You may want to do a specialized index-er that just considers the index-ing styles you use. But I think you are not just vectorizing enough. If you have to access your data frames like that then it must be inside some loop, which would kill your social life.

<pimp-my-project>
Or, you may just use (and pour your effort on improving) SQLiteDF
http://cran.r-project.org/src/contrib/Descriptions/SQLiteDF.html </pimp-my-project>

M. Manese



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed Nov 01 13:35:17 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 02 Nov 2006 - 05:30:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.