Re: [Rd] Re: [R] Do environments make copies?

From: Gabor Grothendieck <ggrothendieck_at_myway.com>
Date: Sun 27 Feb 2005 - 01:28:46 EST

See ?gctorture

Nawaaz Ahmed <nawaaz <at> inktomi.com> writes:

:
: Hi Folks,
: Thanks for all your replies and input. In particular, thanks Luke, for
: explaining what is happening under the covers. In retrospect, my example
: using save and load to demonstrate the problem I was having was a
: mistake - I was trying to reproduce the problem I was having in a simple
: enough way and I thought save and load were showing the same problem
: (i.e. an extra copy was being made). After carefully examining my gc()
: traces,
: I've come to realize that while there are copies being made, there is
: nothing unexpected about it - the failure to allocate memory is really
: because R is hitting the 3GB address limit imposed by my linux box
: during processing. So as Luke suggests, maybe 32 bits is not the right
: platform for handling large data in R.
:
: On the other hand, I think the problem can be somewhat alleviated
: (though not eliminated) if we did garbage collection of temporary
: variables immediately so that we can reduce the memory footprint and the
: fragmentation problem that malloc() is going to be faced with
: (gctorture() is probably too extreme . Most of the problems that I am
: having are in the coercion routines which do create temporary copies.
: So in code of the form x = as.vector(x), it would be nice if the old
: value of x was garbage collected (i.e. if there were no references to it)
:
: nawaaz
:
: Luke Tierney wrote:
: > On Thu, 24 Feb 2005, Berton Gunter wrote:
: >
: >> I was hoping that one of the R gurus would reply to this, but as they
: >> have't
: >> (thus far) I'll try. Caveat emptor!
: >>
: >> First of all, R passes function arguments by values, so as soon as you
: >> call
: >> foo(val) you are already making (at least) one other copy of val for the
: >> call.
: >
: >
: > Conceptually you have a copy, but internally R trieas to use a
: > copy-on-modify strategy to avaoid copying unless necessary. THere are
: > conservative approximations involved, so there is more copying than
: > one might like but definitely not as much as this.
: >
: >
: >> Second,you seem to implicitly make the assumption that assign(..., env=)
: >> uses a pointer to point to the values in the environment. I do not
: >> know how
: >> R handles environments and assignments like this internally, but your
: >> data
: >> seems to indicate that it copies the value and does not merely point
: >> to it
: >> (this is where R Core folks can shed more authoritative light).
: >
: >
: > This assignment does just store the pointer.
: >
: >> Finally, it makes perfect sense to me that, as a data structure, the
: >> environment itself may be small even if it effectively points to (one of
: >> several copies of) large objects, so that object.size(an.environment)
: >> could
: >> be small although the environment may "contain" huge arguments. Again,
: >> the
: >> details depend on the precise implementation and need clarification by
: >> someone who actually knows what's going on here, which ain't me.
: >>
: >> I think the important message is that you shouldn't treat R as C, and you
: >> shouldn't try to circumvent R's internal data structures and
: >> conventions. R
: >> is a language designed to implements Chambers's S model of
: >> "Programming with
: >> Data." Instead of trying to fool R to handle large data sets, maybe you
: >> should consider whether you really **need** all the data in R at one time
: >> and if sensible partitioning or sampling to analyze only a portion or
: >> portions of the data might not be a more effective strategy.
: >
: >
: > R can do quite a reasonable job with large data sets on a resonable
: > platform. A 32 bit platform is not a reasonable one on which to use R
: > with 800 MB chunks of data. Automatic memory management combined with
: > the immutable vector semantics require more elbow room than that. If
: > you really must use data of this size on a 32-bit platform you will
: > probably be muchhappier using a limited amoutn of C code and external
: > pointers.
: >
: > As to what is happening in this example: look at the default parent
: > used by new.env and combine that with the fact that the serialization
: > code does not preserve sharing of atomic objects. The two references
: > to the large object are shared in the original session but lead to two
: > large objects in the saved image and the load. Using
: >
: > ref <- list(env = new.env(parent = .GlobalEnv))
: >
: > in new.ref avoids the second copy both in the saved image and after
: > loading.
: >
: > luke
: >
: >>
: >>> -----Original Message-----
: >>> From: r-help-bounces <at> stat.math.ethz.ch
: >>> [mailto:r-help-bounces <at> stat.math.ethz.ch] On Behalf Of Nawaaz Ahmed
: >>> Sent: Thursday, February 24, 2005 10:36 AM
: >>> To: r-help <at> stat.math.ethz.ch
: >>> Subject: [R] Do environments make copies?
: >>>
: >>> I am using environments to avoid making copies (by keeping
: >>> references).
: >>> But it seems like there is a hidden copy going on somewhere - for
: >>> example in the code fragment below, I am creating a reference to "y"
: >>> (of size 500MB) and storing the reference in object "data".
: >>> But when I
: >>> save "data" and then restore it in another R session, gc()
: >>> claims it is
: >>> using twice the amount of memory. Where/How is this happening?
: >>>
: >>> Thanks for any help in working around this - my datasets are just not
: >>> fitting into my 4GB, 32 bit linux machine (even though my actual data
: >>> size is around 800MB)
: >>>
: >>> Nawaaz
: >>>
: >>> > new.ref <- function(value = NULL) {
: >>> + ref <- list(env = new.env())
: >>> + class(ref) <- "refObject"
: >>> + assign("value", value, env = ref$env)
: >>> + ref
: >>> + }
: >>> > object.size(y)
: >>> [1] 587941404
: >>> > y.ref = new.ref(y)
: >>> > object.size(y.ref)
: >>> [1] 328
: >>> > data = list()
: >>> > data$y.ref = y.ref
: >>> > object.size(data)
: >>> [1] 492
: >>> > save(data, "data.RData")
: >>>
: >>> ...
: >>>
: >>> run R again
: >>> ===========
: >>>
: >>> > load("data.RData")
: >>> > gc()
: >>> used (Mb) gc trigger (Mb)
: >>> Ncells 141051 3.8 350000 9.4
: >>> Vcells 147037925 1121.9 147390241 1124.5
: >>>
: >>> ______________________________________________
: >>> R-help <at> stat.math.ethz.ch mailing list
: >>> https://stat.ethz.ch/mailman/listinfo/r-help
: >>> PLEASE do read the posting guide!
: >>> http://www.R-project.org/posting-guide.html
: >>>
: >>
: >> ______________________________________________
: >> R-help <at> stat.math.ethz.ch mailing list
: >> https://stat.ethz.ch/mailman/listinfo/r-help
: >> PLEASE do read the posting guide!
: >> http://www.R-project.org/posting-guide.html
: >>
: >
:
: ______________________________________________
: R-devel <at> stat.math.ethz.ch mailing list
: https://stat.ethz.ch/mailman/listinfo/r-devel
:
:



R-devel@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun Feb 27 01:44:08 2005

This archive was generated by hypermail 2.1.8 : Sun 27 Feb 2005 - 02:42:40 EST