Re: [Rd] Large discrepancies in the same object being saved to .RData

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Sun, 11 Jul 2010 18:30:20 +0100 (BST)

On Sun, 11 Jul 2010, Tony Plate wrote:

> Another way of seeing the environments referenced in an object is using
> str(), e.g.:
>
>> f1 <- function() {
> + junk <- rnorm(10000000)
> + x <- 1:3
> + y <- rnorm(3)
> + lm(y ~ x)
> + }
>> v1 <- f1()
>> object.size(f1)
> 1636 bytes
>> grep("Environment", capture.output(str(v1)), value=TRUE)
> [1] " .. ..- attr(*, \".Environment\")=<environment: 0x01f11a30> "
> [2] " .. .. ..- attr(*, \".Environment\")=<environment: 0x01f11a30> "

'Some of the environments in a few cases': remember environments have environments (and so on), and that namespaces and packages are also environments. So we need to know about the environment of environment(v1$terms), which also gets saved (either as a reference or as an environment, depending on what it is).

And this approach does not work for many of the commonest cases:

> f <- function() {

+ x <- pi
+ g <- function() print(x)
+ return(g)
+ }

> g <- f()
> str(g)

function ()

In fact I think it works only for formulae.

> -- Tony Plate
>
> On 7/10/2010 10:10 PM, Bill.Venables@csiro.au wrote:
>> Well, I have answered one of my questions below. The hidden
>> environment is attached to the 'terms' component of v1.

Well, not really hidden. A terms component is a formula (see ?terms.object), and a formula has an environment just as a closure does. In neither case does the print() method tell you about it -- but ?formula does.

>> To see this
>>
>>
>>> lapply(v1, environment)
>>>
>> $coefficients
>> NULL
>>
>> $residuals
>> NULL
>>
>> $effects
>> NULL
>>
>> $rank
>> NULL
>>
>> $fitted.values
>> NULL
>>
>> $assign
>> NULL
>>
>> $qr
>> NULL
>>
>> $df.residual
>> NULL
>>
>> $xlevels
>> NULL
>>
>> $call
>> NULL
>>
>> $terms
>> <environment: 0x021b9e18>
>>
>> $model
>> NULL
>>
>>
>>> rm(junk, envir = with(v1, environment(terms)))
>>> usedVcells()
>>>
>> [1] 96532
>>
>>>
>>>
>> This is still a bit of a trap for young (and old!) players...
>>
>> I think the main point in my mind is why is it that object.size()
>> excludes enclosing environments in its reckonings?
>>
>> Bill Venables.
>>
>> -----Original Message-----
>> From: Venables, Bill (CMIS, Cleveland)
>> Sent: Sunday, 11 July 2010 11:40 AM
>> To: 'Duncan Murdoch'; 'Paul Johnson'
>> Cc: 'r-devel_at_r-project.org'; Taylor, Julian (CMIS, Waite Campus)
>> Subject: RE: [Rd] Large discrepancies in the same object being saved to
>> .RData
>>
>> I'm still a bit puzzled by the original question. I don't think it
>> has much to do with .RData files and their sizes. For me the puzzle
>> comes much earlier. Here is an example of what I mean using a little
>> session
>>
>>
>>> usedVcells<- function() gc()["Vcells", "used"]
>>> usedVcells() ### the base load
>>>
>> [1] 96345
>>
>> ### Now look at what happens when a function returns a formula as the
>> ### value, with a big item floating around in the function closure:
>>
>>
>>> f0<- function() {
>>>
>> + junk<- rnorm(10000000)
>> + y ~ x
>> + }
>>
>>> v0<- f0()
>>> usedVcells() ### much bigger than base, why?
>>>
>> [1] 10096355
>>
>>> v0 ### no obvious envirnoment
>>>
>> y ~ x
>>
>>> object.size(v0) ### so far, no clue given where
>>>
>> ### the extra Vcells are located.
>> 372 bytes
>>
>> ### Does v0 have an enclosing environment?
>>
>>
>>> environment(v0) ### yep.
>>>
>> <environment: 0x021cc538>
>>
>>> ls(envir = environment(v0)) ### as expected, there's the junk
>>>
>> [1] "junk"
>>
>>> rm(junk, envir = environment(v0)) ### this does the trick.
>>> usedVcells()
>>>
>> [1] 96355
>>
>> ### Now consider a second example where the object
>> ### is not a formula, but contains one.
>>
>>
>>> f1<- function() {
>>>
>> + junk<- rnorm(10000000)
>> + x<- 1:3
>> + y<- rnorm(3)
>> + lm(y ~ x)
>> + }
>>
>>
>>> v1<- f1()
>>> usedVcells() ### as might have been expected.
>>>
>> [1] 10096455
>>
>> ### in this case, though, there is no
>> ### (obvious) enclosing environment
>>
>>
>>> environment(v1)
>>>
>> NULL
>>
>>> object.size(v1) ### so where are the junk Vcells located?
>>>
>> 7744 bytes
>>
>>> ls(envir = environment(v1)) ### clearly wil not work
>>>
>> Error in ls(envir = environment(v1)) : invalid 'envir' argument
>>
>>
>>> rm(v1) ### removing the object does clear out the junk.
>>> usedVcells()
>>>
>> [1] 96366
>>
>>>
>> And in this second case, as noted by Julian Taylor, if you save() the
>> object the .RData file is also huge. There is an environment attached
>> to the object somewhere, but it appears to be occluded and entirely
>> inaccessible. (I have poked around the object components trying to
>> find the thing but without success.)
>>
>> Have I missed something?
>>
>> Bill Venables.
>>
>> -----Original Message-----
>> From: r-devel-bounces_at_r-project.org [mailto:r-devel-bounces_at_r-project.org]
>> On Behalf Of Duncan Murdoch
>> Sent: Sunday, 11 July 2010 10:36 AM
>> To: Paul Johnson
>> Cc: r-devel_at_r-project.org
>> Subject: Re: [Rd] Large discrepancies in the same object being saved to
>> .RData
>>
>> On 10/07/2010 2:33 PM, Paul Johnson wrote:
>>
>>> On Wed, Jul 7, 2010 at 7:12 AM, Duncan Murdoch<murdoch.duncan_at_gmail.com>
>>> wrote:
>>>
>>>
>>>> On 06/07/2010 9:04 PM, Julian.Taylor_at_csiro.au wrote:
>>>>
>>>>
>>>>> Hi developers,
>>>>>
>>>>>
>>>>>
>>>>> After some investigation I have found there can be large discrepancies
>>>>> in
>>>>> the same object being saved as an external "xx.RData" file. The
>>>>> immediate
>>>>> repercussion of this is the possible increased size of your .RData
>>>>> workspace
>>>>> for no apparent reason.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> I haven't worked through your example, but in general the way that local
>>>> objects get captured is when part of the return value includes an
>>>> environment.
>>>>
>>>>
>>> Hi, can I ask a follow up question?
>>>
>>> Is there a tool to browse *.Rdata files without loading them into R?
>>>
>>>
>> I don't know of one. You can load the whole file into an empty
>> environment, but then you lose information about "where did it come from"?
>>
>> Duncan Murdoch
>>
>>> In HDF5 (a data storage format we use sometimes), there is a CLI
>>> program "h5dump" that will spit out line-by-line all the contents of a
>>> storage entity. It will literally track through all the metadata, all
>>> the vectors of scores, etc. I've found that handy to "see what's
>>> really in there" in cases like the one that OP asked about.
>>> Sometimes, we find that there are things that are "in there" by
>>> mistake, as Duncan describes, and then we can try to figure why they
>>> are in there.
>>>
>>> pj
>>>
>>>
>>>
>>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>>
>>
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Sun 11 Jul 2010 - 17:33:12 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 12 Jul 2010 - 01:10:13 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive