Re: [Rd] Some R questions

From: Vladimir Dergachev <vdergachev_at_rcgardis.com>
Date: Wed 01 Nov 2006 - 18:40:39 GMT

On Tuesday 31 October 2006 9:30 pm, miguel manese wrote:
> Hi,
>
> Had experience with this on doing SQLiteDF...
>
> On 11/1/06, Vladimir Dergachev <vdergachev@rcgardis.com> wrote:
> > Hi all,
> >
> > I am working with some large data sets (1-4 GB) and have some
> > questions that I hope someone can help me with:
> >
> > 1. Is there a way to turn off garbage collector from within C
> > interface ? what I am trying to do is suck data from mysql (using my own
> > C functions) and I see that allocating each column (with about 1-4
> > million items) takes between 0.5 and 1 seconds. My first thought was that
> > it would be nice to turn off garbage collector, allocate all the data,
> > copy values and then turn the garbage collector back on.
>
> I believe not. FWIW a numeric() vector is a chunk of memory with a
> VECTOR_SEXP header and then your data contiguously allocated. If you
> are desparate enough and assuming the garbage collector is indeed the
> culprit, you may want to implement your own lightweight allocVector
> (the function expanded to by NEW_NUMERIC(), etc.)

Thank you very much for the suggestion ! After looking around in the code I realized that what I really wanted was R_gc_internal() - as then I can tell the garbage collector in advance that I will require that much heap and that it does not need to go and allocate it each time I asked (btw I would have expected it to double the heap each time it runs out of it, but this is not what goes on, at least in R 2.3.1).

After some mucking around here is a poor mans substitute which might be useful:

void fault_mem_region(long size)
{
long chunk;
int max=(1<<30) / sizeof(int);
int block_count=0;
SEXP block;
while(size>0) {

	chunk=size;
	if(chunk > max)
		chunk=max;
	PROTECT(block=allocVector(INTSXP, chunk));
	block_count++;
	size-=chunk;
	}

UNPROTECT(block_count);
}

On a 48 column data frame (with 1.2e6 rows) the call fault_mem_region(ncol+nrow*11+ncol*nrow) shaved off 5 seconds from 33 second running time (which includes running mysql query).

It is not perfect however as I could see the last columns allocating slower than initial ones.

Also, while looking around in allocVector I saw that after running garbage collector it simply calls malloc and if malloc fails it calls garbage collector again.

What would be nice is the ability to bypass that first garbage collector call when allocating large nodes.

>
> > 2. For creating STRSXP should I be using mkChar() or mkString() to
> > create element values ? Is there a way to do it without allocating a cons
> > cell ? (otherwise a single STRSXP with 1e6 length slows down garbage
> > collector)
>
> A string vector (STRSXP) is composed of CHARSXP's. mkChar makes ar
> CHARSXP, and mkString makes a STRSXP with 1 CHARSXP, more like a
> shorthand for
>
> SEXP str = NEW_CHARACTER(1);
> SET_STRING_ELT(str, 0, mkChar("foo"));

Makes sense - thank you !

>
> > 3. Is "row.names" attribute required for data frames and, if so, can
> > I use some other type besides STRSXP ?
>
> It is required. It can be integers, for 2.4.0+
>

Great !

> > 4. While poking around to find out why some of my code is
> > excessively slow I have come upon definition of `[.data.frame` -
> > subscription operator for data frames, which appears to be written in R.
> > I am wondering whether I am looking at the right place and whether anyone
> > would be interested in a piece of C code optimizing it - in particular
> > extraction of single element is quite slow (i.e. calls like T[i, j]).
>
> [.data.frame is such a pain to implement because there is just too
> many ways to index a data frame. You may want to do a specialized
> index-er that just considers the index-ing styles you use. But I think
> you are not just vectorizing enough. If you have to access your data
> frames like that then it must be inside some loop, which would kill
> your social life.

Hmm, I thought to implement subscription with integer or logical vectors and then some hash-based lookup for column and (possibly) row names.

The slowness manifests itself for vectorized code as well. I believe it is due to the code mucking about with row.names attribute which introduces a penalty on any [,] operation - penalty that grows linearly with the number of rows.

Thus for large data frames A[,1] is slower than A[[1]]. For example, for the data frame I mentioned above E<-A[[1]] took 0.46 seconds (way too much in my opinion), but E<-A[,1] took 62.45 seconds - more than a minute and more than twice the time it took to load the entire thing into memory. Silly, isn't it ?

Also, there are good reasons to want to address individual cells. And there is no reason why such access cannot be constant time.

>
> <pimp-my-project>
> Or, you may just use (and pour your effort on improving) SQLiteDF
> http://cran.r-project.org/src/contrib/Descriptions/SQLiteDF.html
> </pimp-my-project>

Very nice ! The documentation mentioned something about assignment operator not working - is this still true ? Or, maybe, I misunderstood something ?

Also, I wonder whether it would be possible to extend [[ operator so one can run queries: SQLDF[["SELECT * FROM a WHERE.."]]

                           thank you very much !

                                     Vladimir Dergachev

>
> M. Manese



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu Nov 02 15:31:32 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 02 Nov 2006 - 20:30:32 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.