Re: [Rd] Large discrepancies in the same object being saved to .RData

From: Duncan Murdoch <murdoch.duncan_at_gmail.com>
Date: Sun, 11 Jul 2010 00:12:53 -0400

On 10/07/2010 10:10 PM, Bill.Venables_at_csiro.au wrote:
> Well, I have answered one of my questions below. The hidden
> environment is attached to the 'terms' component of v1.
>
> To see this
>
>
>> lapply(v1, environment)
>>
> $coefficients
> NULL
>
> $residuals
> NULL
>
> $effects
> NULL
>
> $rank
> NULL
>
> $fitted.values
> NULL
>
> $assign
> NULL
>
> $qr
> NULL
>
> $df.residual
> NULL
>
> $xlevels
> NULL
>
> $call
> NULL
>
> $terms
> <environment: 0x021b9e18>
>
> $model
> NULL
>
>
>> rm(junk, envir = with(v1, environment(terms)))
>> usedVcells()
>>
> [1] 96532
>
>>
>>
>
> This is still a bit of a trap for young (and old!) players...
>
> I think the main point in my mind is why is it that object.size()
> excludes enclosing environments in its reckonings?
>

I think the idea is that the environment is not part of the object, it is just referenced by the object. In fact, there are at least two references to the environment in your second example:

environment(v1$terms)

and

attr(v1$terms, ".Environment")

both refer to it. So you can't just add the size of an environment every time you come across it, you would need to keep track of whether it had already been counted or not. So as ?object.size says,

"Associated space (e.g. the environment of a function and what the pointer in a ‘EXTPTRSXP’ points to) is not included in the calculation."
> If you really want to know how much space an object will take when saved, probably the only reliable way is to save the object and look at how much space the file takes.
>

Duncan Murdoch

> Bill Venables.
>
> -----Original Message-----
> From: Venables, Bill (CMIS, Cleveland)
> Sent: Sunday, 11 July 2010 11:40 AM
> To: 'Duncan Murdoch'; 'Paul Johnson'
> Cc: 'r-devel_at_r-project.org'; Taylor, Julian (CMIS, Waite Campus)
> Subject: RE: [Rd] Large discrepancies in the same object being saved to .RData
>
> I'm still a bit puzzled by the original question. I don't think it
> has much to do with .RData files and their sizes. For me the puzzle
> comes much earlier. Here is an example of what I mean using a little
> session
>
>
>> usedVcells <- function() gc()["Vcells", "used"]
>> usedVcells() ### the base load
>>
> [1] 96345
>
> ### Now look at what happens when a function returns a formula as the
> ### value, with a big item floating around in the function closure:
>
>
>> f0 <- function() {
>>
> + junk <- rnorm(10000000)
> + y ~ x
> + }
>
>> v0 <- f0()
>> usedVcells() ### much bigger than base, why?
>>
> [1] 10096355
>
>> v0 ### no obvious envirnoment
>>
> y ~ x
>
>> object.size(v0) ### so far, no clue given where
>>
> ### the extra Vcells are located.
> 372 bytes
>
> ### Does v0 have an enclosing environment?
>
>
>> environment(v0) ### yep.
>>
> <environment: 0x021cc538>
>
>> ls(envir = environment(v0)) ### as expected, there's the junk
>>
> [1] "junk"
>
>> rm(junk, envir = environment(v0)) ### this does the trick.
>> usedVcells()
>>
> [1] 96355
>
> ### Now consider a second example where the object
> ### is not a formula, but contains one.
>
>
>> f1 <- function() {
>>
> + junk <- rnorm(10000000)
> + x <- 1:3
> + y <- rnorm(3)
> + lm(y ~ x)
> + }
>
>
>> v1 <- f1()
>> usedVcells() ### as might have been expected.
>>
> [1] 10096455
>
> ### in this case, though, there is no
> ### (obvious) enclosing environment
>
>
>> environment(v1)
>>
> NULL
>
>> object.size(v1) ### so where are the junk Vcells located?
>>
> 7744 bytes
>
>> ls(envir = environment(v1)) ### clearly wil not work
>>
> Error in ls(envir = environment(v1)) : invalid 'envir' argument
>
>
>> rm(v1) ### removing the object does clear out the junk.
>> usedVcells()
>>
> [1] 96366
>
>
> And in this second case, as noted by Julian Taylor, if you save() the
> object the .RData file is also huge. There is an environment attached
> to the object somewhere, but it appears to be occluded and entirely
> inaccessible. (I have poked around the object components trying to
> find the thing but without success.)
>
> Have I missed something?
>
> Bill Venables.
>
> -----Original Message-----
> From: r-devel-bounces_at_r-project.org [mailto:r-devel-bounces_at_r-project.org] On Behalf Of Duncan Murdoch
> Sent: Sunday, 11 July 2010 10:36 AM
> To: Paul Johnson
> Cc: r-devel_at_r-project.org
> Subject: Re: [Rd] Large discrepancies in the same object being saved to .RData
>
> On 10/07/2010 2:33 PM, Paul Johnson wrote:
>
>> On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdoch <murdoch.duncan_at_gmail.com> wrote:
>>
>>
>>> On 06/07/2010 9:04 PM, Julian.Taylor_at_csiro.au wrote:
>>>
>>>
>>>> Hi developers,
>>>>
>>>>
>>>>
>>>> After some investigation I have found there can be large discrepancies in
>>>> the same object being saved as an external "xx.RData" file. The immediate
>>>> repercussion of this is the possible increased size of your .RData workspace
>>>> for no apparent reason.
>>>>
>>>>
>>>>
>>>>
>>>>
>>> I haven't worked through your example, but in general the way that local
>>> objects get captured is when part of the return value includes an
>>> environment.
>>>
>>>
>> Hi, can I ask a follow up question?
>>
>> Is there a tool to browse *.Rdata files without loading them into R?
>>
>>
>
> I don't know of one. You can load the whole file into an empty
> environment, but then you lose information about "where did it come from"?
>
> Duncan Murdoch
>
>> In HDF5 (a data storage format we use sometimes), there is a CLI
>> program "h5dump" that will spit out line-by-line all the contents of a
>> storage entity. It will literally track through all the metadata, all
>> the vectors of scores, etc. I've found that handy to "see what's
>> really in there" in cases like the one that OP asked about.
>> Sometimes, we find that there are things that are "in there" by
>> mistake, as Duncan describes, and then we can try to figure why they
>> are in there.
>>
>> pj
>>
>>
>>
>>
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun 11 Jul 2010 - 04:15:19 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 11 Jul 2010 - 15:15:13 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive