Re: [Rd] allocVector bug ?

From: Vladimir Dergachev <vdergachev_at_rcgardis.com>
Date: Wed 08 Nov 2006 - 21:33:17 GMT

On Wednesday 08 November 2006 12:56 pm, Luke Tierney wrote:
> On Mon, 6 Nov 2006, Vladimir Dergachev wrote:
> > Hi Luke,
> >
> >
> > I generally agree with this, however I believe that current logic breaks
> > down for large allocation sizes and my code ends up spending 70% (and up)
> > of computer time spinning inside garbage collector (I run oprofile to
> > observe what is going on).
>
> Again please be careful about these sorts of statements. I am sure
> there are bugs in the memory manager and places where things "break
> down" but this isn't one of them. The memory manager is quite
> deliberately biased towards keeping the total allocation low, if
> necessary at the expense of some extra gc overhead. This is needed if
> we want to use the same settings across a wide range of
> configurations, some of which have relatively little memory available
> (think student labs). The memory manager does try to learn about the
> needs of a session, and as a result triggering value get adjusted. It
> is not true that every large allocation causes a gc. This may be true
> _initially_, but once total memory usage stabilizes at a particular
> level it is no longer true (look at the way the heap limits are
> adjusted).
>
> This approach of adjusting based on usage within a session is
> reasonable and works well for longer sessions. It may not work well
> for short scripts that need large allocations. I doubt that any
> automated setting can work well in that situation while at the same
> time keeping memory usage in other settings low. So it may be useful
> to find ways of specifying a collection strategy appropriate for these
> situations. If you can send me a simplified version of your usage
> scenario then I will give this some thought and see if we can come up
> with some reasonable ways of allowing user code to tweak gc behavior
> for these situations.
>

Hi Luke,    

   Yes, I gladly concede the point that for a heuristic algorithm the notion of what is a "bug" is murky (besides crashes, etc, which is not what I am not talking about).

   Here is why I called this a bug:

  1. My understanding is that each time gc() needs to increase memory it performs a full garbage collection run. Right ?
  2. This is not a problem with small memory sizes as they imply (presumably) small number of objects.
  3. However, if one wants to allocate many objects (say columns in a data frame or just vectors) this results in large penalty

Example 1: This simulates allocation of a data.frame with some character columns which are assumed to be factors. On my system first assignment is nearly instantaneous, why subsequent assignments take slightly less than 0.1 seconds each.

L<-list()
Chars<-as.character(1:100000)
for(i in 1:100)L[[i]]<-system.time(assign(paste("test", i), 1:1000000)) Times<-do.call(rbind, L)

Example 2: Same as example 1 but we first grow the memory with fake allocation:

L<-list()
Chars<-as.character(1:100000)
Data<-1:100000000
rm(Data)
for(i in 1:100)L[[i]]<-system.time(assign(paste("test", i), 1:1000000)) Times<-do.call(rbind, L)

In this case the first 20 or so allocations are very quck (faster than 0.02 sec) and then garbage collector kicks in and the time rises to 0.08 seconds each - still less than in Example 1.

This example is relevant because this sequence of allocations is exactly what happens when one uses read.table or scan (or database query) to load data.

What is more, if the user then manipulates the loaded data by creating columns that are a combination of existing ones then this is very slow as well.

I looked more carefully at your code in src/main/memory.c, function AdjustHeapSize:

R_VSize = VNeeded;

    if (vect_occup > R_VGrowFrac) {

	R_size_t change = R_VGrowIncrMin + R_VGrowIncrFrac * R_NSize;
	if (R_MaxVSize - R_VSize >= change)
	    R_VSize += change;

    }

Could it be that R_NSize should be R_VSize ? This would explain why I see a problem in case R_VSize>>R_NSize.

                            thank you very much !

                                     Vladimir Dergachev

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu Nov 09 17:38:36 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 10 Nov 2006 - 04:30:43 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.