Re: [Rd] allocVector bug ?

From: Vladimir Dergachev <vdergachev_at_rcgardis.com>
Date: Mon 06 Nov 2006 - 22:33:20 GMT

Hi Luke,

   Thank you for the patient reply !

   I have looked into the issue a little deeper, comments below:

On Thursday 02 November 2006 11:26 pm, Luke Tierney wrote:
> On Wed, 1 Nov 2006, Vladimir Dergachev wrote:
> > Hi all,
> >
> > I was looking at the following piece of code in src/main/memory.c,
> > function allocVector :
> >
> > if (size <= NodeClassSize[1]) {
> > node_class = 1;
> > alloc_size = NodeClassSize[1];
> > }
> > else {
> > node_class = LARGE_NODE_CLASS;
> > alloc_size = size;
> > for (i = 2; i < NUM_SMALL_NODE_CLASSES; i++) {
> > if (size <= NodeClassSize[i]) {
> > node_class = i;
> > alloc_size = NodeClassSize[i];
> > break;
> > }
> > }
> > }
> >
> >
> > It appears that for LARGE_NODE_CLASS the variable alloc_size should not
> > be size, but something far less as we are not using vector heap, but
> > rather calling malloc directly in the code below (and from discussions I
> > read on this mailing list I think that these two are different - please
> > let me know if I am wrong).
> >
> > So when allocate a large vector the garbage collector goes nuts trying to
> > find all that space which is not going to be needed after all.

>

> This is as intended, not a bug. The garbage collector does not "go
> nuts" -- it is doing a garbage collection that may release memory in
> advance of making a large allocation. The size of the current
> allocation request is used as part of the process of deciding when to
> satisfy an allocation by malloc (of a single large noda or a page) and
> when to first do a gc. It is essential to do this for large
> allocations as well to keep the memory footprint down and help reduce
> fragmentation.

I generally agree with this, however I believe that current logic breaks down for large allocation sizes and my code ends up spending 70% (and up) of computer time spinning inside garbage collector (I run oprofile to observe what is going on).

I do realize that garbage collection is not an easy problem and that hardware and software environments change - my desire is simply to have a version of R that is usable for the problems I am dealing with as, aside from slowdown with large vector sizes, I find R a very capable tool.

I would greatly appreciate if you could comment on the following observations:

  1. The time spent during single garbage collector run grows with the number of nodes - from looking at the code I believe it is linear, but I am not certain.
  2. In my case the data.frame contains a few string vectors. These allocate lots of CHARSXPs which are the main cause of slowdown of each garbage collector run. Would you have any suggestions on optimizing this particular situation ?
  3. Any time a data.frame is created, or one performs an attach() operation there is a series of allocations - and if one of them causes memory to expand all the rest will too.

  I put in a fprintf() statement to show alloc_size, VHEAP_FREE and RV_size when allocVector is called (this is done only for node_class == LARGE_NODE_CLASS).   First output snippet is from the time the script starts and tries to create data.frame:

alloc_size=128 VHEAP_FREE=604182 R_VSize=786432
alloc_size=88 VHEAP_FREE=660051 R_VSize=786432
alloc_size=88 VHEAP_FREE=659963 R_VSize=786432
alloc_size=4078820 VHEAP_FREE=659874 R_VSize=786432
alloc_size=4078820 VHEAP_FREE=260678 R_VSize=4465461
alloc_size=4078820 VHEAP_FREE=260678 R_VSize=8544282
alloc_size=4078820 VHEAP_FREE=260678 R_VSize=12623103
...
alloc_size=4078820 VHEAP_FREE=260677 R_VSize=271628325 alloc_size=4078820 VHEAP_FREE=260677 R_VSize=275707147

As you can see the VHEAP_FREE()<alloc_size logic gets triggered every single time and we run the garbage collector even though the memory footprint is clearly growing. The fact that in my case the first few columns are strings with lots of different values makes this particularly bad.

Next, (as a test) I run attach(B) which produced the following output:
> attach(B)

alloc_size=4078820 VHEAP_FREE=1274112 R_VSize=294022636 alloc_size=4078820 VHEAP_FREE=499351 R_VSize=297325768 ...

alloc_size=4078820 VHEAP_FREE=602082 R_VSize=568670030
alloc_size=4078820 VHEAP_FREE=602082 R_VSize=572748850
alloc_size=4078820 VHEAP_FREE=602082 R_VSize=576827670
alloc_size=88 VHEAP_FREE=602082 R_VSize=580906490
alloc_size=88 VHEAP_FREE=601915 R_VSize=580906490
alloc_size=88 VHEAP_FREE=601798 R_VSize=580906490
alloc_size=88 VHEAP_FREE=601678 R_VSize=580906490
...
alloc_size=44 VHEAP_FREE=591581 R_VSize=580906490
alloc_size=88 VHEAP_FREE=591323 R_VSize=580906490
alloc_size=44 VHEAP_FREE=591220 R_VSize=580906490

So we have the same behaviour as before - the garbage collector gets run every time attach creates a new large vector, but functions perfectly for smaller vector sizes.

Next, I did detach(B) (which freed up memory) followed by "F<-B[,1]":

alloc_size=113 VHEAP_FREE=588448 R_VSize=580906490
alloc_size=618 VHEAP_FREE=588335 R_VSize=580906490
alloc_size=618 VHEAP_FREE=587717 R_VSize=580906490
alloc_size=128 VHEAP_FREE=587099 R_VSize=580906490
alloc_size=88 VHEAP_FREE=586825 R_VSize=580906490
alloc_size=4078820 VHEAP_FREE=586737 R_VSize=580906490
alloc_size=4078820 VHEAP_FREE=284079854 R_VSize=580906490
alloc_size=4078820 VHEAP_FREE=280001034 R_VSize=580906490
alloc_size=4078820 VHEAP_FREE=275922214 R_VSize=580906490
alloc_size=4078820 VHEAP_FREE=271843394 R_VSize=580906490
...
alloc_size=4078820 VHEAP_FREE=602082 R_VSize=843990775
alloc_size=4078820 VHEAP_FREE=602082 R_VSize=848069595
alloc_size=4078820 VHEAP_FREE=602082 R_VSize=852148415
alloc_size=4078820 VHEAP_FREE=602082 R_VSize=856227235
alloc_size=88 VHEAP_FREE=602082 R_VSize=860306055

In this case the logic worked - it freed memory used up by attach() but stopped working after that memory was consumed. (Why subscription operator tries to allocate an entire copy of the data.frame is a completely separate question - I am looking into it, but with slow progress).

My latest attempt to address this is the following modification to allocVector:

R_size_t vheap_grown=0;
R_size_t vheap_reused=0;
int count_skip=0;
int max_skip=0;

SEXP allocVector(SEXPTYPE type, R_len_t length) {
...
...

    /* we need to do the gc here so allocSExp doesn't! */     if (FORCE_GC || NO_FREE_NODES() || ((VHEAP_FREE() < alloc_size) && ! count_skip)) {

	fprintf(stderr, "Running gc\n");
	R_gc_internal(alloc_size);
	if (NO_FREE_NODES())
	    mem_err_cons();
	if (VHEAP_FREE() < alloc_size)
	    mem_err_heap(size);

	/* we use 2*alloc_size as if we run out of memory the
          gc will grow R_VSize to alloc for alloc_size allocation */
        if(VHEAP_FREE() < 2*alloc_size) {
	    vheap_grown+=alloc_size;
	    if(vheap_grown>vheap_reused) {
		max_skip=2*max_skip+1;
		count_skip=max_skip;
		}
	    } else {
	    vheap_reused+=alloc_size;
	    if(vheap_reused>2*vheap_grown) {
		   vheap_reused=0;
		   vheap_grown=0;
		   max_skip=0;
		   }
	    }

    }

    /* Adjust R_VSize if we skipped garbage collection */     if( (VHEAP_FREE() < alloc_size) && count_skip) {

	 R_VSize+=alloc_size;
	 count_skip --;
	 }

...
...
}

This code attempts to detect situations when we are growing memory footprint and skip garbage collection before making a check again.

                              thank you very much !

                                            Vladimir Dergachev

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue Nov 07 16:01:00 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 09 Nov 2006 - 02:30:43 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.