[Rd] Re: [R] Do environments make copies?

From: Nawaaz Ahmed <nawaaz_at_inktomi.com>
Date: Sun 27 Feb 2005 - 01:23:12 EST

Hi Folks,
Thanks for all your replies and input. In particular, thanks Luke, for explaining what is happening under the covers. In retrospect, my example   using save and load to demonstrate the problem I was having was a mistake - I was trying to reproduce the problem I was having in a simple enough way and I thought save and load were showing the same problem (i.e. an extra copy was being made). After carefully examining my gc() traces,
I've come to realize that while there are copies being made, there is nothing unexpected about it - the failure to allocate memory is really because R is hitting the 3GB address limit imposed by my linux box during processing. So as Luke suggests, maybe 32 bits is not the right platform for handling large data in R.

On the other hand, I think the problem can be somewhat alleviated (though not eliminated) if we did garbage collection of temporary variables immediately so that we can reduce the memory footprint and the fragmentation problem that malloc() is going to be faced with (gctorture() is probably too extreme :-). Most of the problems that I am having are in the coercion routines which do create temporary copies. So in code of the form x = as.vector(x), it would be nice if the old value of x was garbage collected (i.e. if there were no references to it)

nawaaz

Luke Tierney wrote:
> On Thu, 24 Feb 2005, Berton Gunter wrote:
>

>> I was hoping that one of the R gurus would reply to this, but as they 
>> have't
>> (thus far) I'll try. Caveat emptor!
>>
>> First of all, R passes function arguments by values, so as soon as you 
>> call
>> foo(val) you are already making (at least) one other copy of val for the
>> call.

>
>
> Conceptually you have a copy, but internally R trieas to use a
> copy-on-modify strategy to avaoid copying unless necessary. THere are
> conservative approximations involved, so there is more copying than
> one might like but definitely not as much as this.
>
>
>> Second,you seem to implicitly make the assumption that assign(..., env=)
>> uses a pointer to point to the values in the environment. I do not 
>> know how
>> R handles environments and assignments like this internally, but your 
>> data
>> seems to indicate that it copies the value and does not merely point 
>> to it
>> (this is where R Core folks can shed more authoritative light).

>
>
> This assignment does just store the pointer.
>
>> Finally, it makes perfect sense to me that, as a data structure, the
>> environment itself may be small even if it effectively points to (one of
>> several copies of) large objects, so that object.size(an.environment) 
>> could
>> be small although the environment may "contain" huge arguments. Again, 
>> the
>> details depend on the precise implementation and need clarification by
>> someone who actually knows what's going on here, which ain't me.
>>
>> I think the important message is that you shouldn't treat R as C, and you
>> shouldn't try to circumvent R's internal data structures and 
>> conventions. R
>> is a language designed to implements Chambers's S model of 
>> "Programming with
>> Data." Instead of trying to fool R to handle large data sets, maybe you
>> should consider whether you really **need** all the data in R at one time
>> and if sensible partitioning or sampling to analyze only a portion or
>> portions of the data might not be a more effective strategy.

>
>
> R can do quite a reasonable job with large data sets on a resonable
> platform. A 32 bit platform is not a reasonable one on which to use R
> with 800 MB chunks of data. Automatic memory management combined with
> the immutable vector semantics require more elbow room than that. If
> you really must use data of this size on a 32-bit platform you will
> probably be muchhappier using a limited amoutn of C code and external
> pointers.
>
> As to what is happening in this example: look at the default parent
> used by new.env and combine that with the fact that the serialization
> code does not preserve sharing of atomic objects. The two references
> to the large object are shared in the original session but lead to two
> large objects in the saved image and the load. Using
>
> ref <- list(env = new.env(parent = .GlobalEnv))
>
> in new.ref avoids the second copy both in the saved image and after
> loading.
>
> luke
>
>>
>>> -----Original Message-----
>>> From: r-help-bounces@stat.math.ethz.ch
>>> [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Nawaaz Ahmed
>>> Sent: Thursday, February 24, 2005 10:36 AM
>>> To: r-help@stat.math.ethz.ch
>>> Subject: [R] Do environments make copies?
>>>
>>> I am using environments to avoid making copies (by keeping
>>> references).
>>> But it seems like there is a hidden copy going on somewhere - for
>>> example in the code fragment below, I am creating a reference to "y"
>>> (of size 500MB) and storing the reference in object "data".
>>> But when I
>>> save "data" and then restore it in another R session, gc()
>>> claims it is
>>> using twice the amount of memory. Where/How is this happening?
>>>
>>> Thanks for any help in working around this - my datasets are just not
>>> fitting into my 4GB, 32 bit linux machine (even though my actual data
>>> size is around 800MB)
>>>
>>> Nawaaz
>>>
>>> > new.ref <- function(value = NULL) {
>>> +     ref <- list(env = new.env())
>>> +     class(ref) <- "refObject"
>>> +     assign("value", value, env = ref$env)
>>> +     ref
>>> + }
>>> > object.size(y)
>>> [1] 587941404
>>> > y.ref = new.ref(y)
>>> > object.size(y.ref)
>>> [1] 328
>>> > data = list()
>>> > data$y.ref = y.ref
>>> > object.size(data)
>>> [1] 492
>>> > save(data, "data.RData")
>>>
>>> ...
>>>
>>> run R again
>>> ===========
>>>
>>> > load("data.RData")
>>> > gc()
>>>              used   (Mb) gc trigger   (Mb)
>>> Ncells    141051    3.8     350000    9.4
>>> Vcells 147037925 1121.9  147390241 1124.5
>>>
>>> ______________________________________________
>>> R-help@stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide!
>>> http://www.R-project.org/posting-guide.html
>>>
>>
>> ______________________________________________
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! 
>> http://www.R-project.org/posting-guide.html
>>
>

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun Feb 27 01:25:23 2005

This archive was generated by hypermail 2.1.8 : Fri 18 Mar 2005 - 09:02:59 EST