[Rd] update() problem (was: saving objects with embedded environments)

From: Duncan Murdoch <murdoch_at_stats.uwo.ca>
Date: Tue, 03 Jul 2007 11:39:16 -0400

I don't want this thread to be lost, so I've changed the subject heading.

I think there's a problem here, but the problem is that update() is failing, not that the environment is being unnecessarily saved. The problem here is that update() has no method specific to lm objects, so it doesn't know about the saved environment. I don't know what other kinds of objects update.default can handle properly, but if they don't all have terms, there should probably be a specific update.lm method that knows to look there.

Now it's not completely obvious what update() should do in general, e.g. when adding terms. If a user asks to add a term Foo to a model, they may mean the Foo that is in their workspace, not the Foo that happens to be in the data frame but not previously included in the model. However, I think in the case of an lm object produced in a function, the updating should be done in the environment of the function.

Duncan Murdoch

On 7/2/2007 9:27 AM, McGehee, Robert wrote:

> Thanks for this. So at the risk of treading out too deep into unfamiliar
> water, one concern is that if I run 'lm' within a function and then the
> function exits, am I still (perhaps unnecessarily) keeping a copy of the
> function environment and the associated data? It does not seem that
> 'update' even works after I exit the function, so it's not clear to me
> what help saving the environment and a copy of all of its data is (see
> example below).
> 

>> B <- data.frame(y=1:100, x=rnorm(100))
>> FUN <- function(B) lm(y ~ x, data=B)
>> m <- FUN(B)
>> rm(B)
> ## update doesn't find object 'B' in function environment (so why store
> the environment?)

>> update(m, y ~ 1)
> Error in inherits(x, "data.frame") : object "B" not found
> 
> ## However, there is a copy of object 'B' saved anyway, even
> ## after removing it from the global environment and exiting the
> function

>> dim(get("B", envir=attr(m$terms, ".Environment")))
> [1] 100   2
> 
> For my purposes all works well now. I brought this up only as one can
> quickly run out of memory if data is unnecessarily kept around after
> with large models. Anecdotally, before isolating this issue I would
> crash my snow/MPI session (followed by R) when trying to transfer these
> 'lm' and 'lm'-like objects with embedded environments. If I did not
> distribute the processing, then I found that I would rather quickly use
> up all 24GB of my computer's memory and swap space after repeated calls.
> 
> Thanks,
> Robert
> 
> -----Original Message-----
> From: Roger Peng [mailto:rdpeng_at_gmail.com] 
> Sent: Friday, June 29, 2007 7:44 PM
> To: McGehee, Robert
> Cc: R-devel
> Subject: Re: [Rd] saving objects with embedded environments
> 
> I believe this is intentional.  See ?serialize.  When lm() is called
> in a function, the environment is saved in case the resulting fitted
> model object needs to be updated, for example, with update().
> 
> if you don't want the linear model object, you might try just saving
> the relevant objects to a separate list rather than try to delete
> everything that is irrelevant from the 'lm' object.
> 
> -roger
> 
> On 6/28/07, McGehee, Robert <Robert.McGehee_at_geodecapital.com> wrote:

>> Hello,
>> I have been running linear regressions on large data sets. As 'lm'
> saves

>> a great deal of extraneous (for me) data including the residuals,
>> fitted.values, model frame, etc., I generally set these to NULL within
>> the object before saving off the model to a file.
>>
>> In the below example, however, I have found that depending on whether
> or

>> not I run 'lm' within another function or not, the entire function
>> environment is saved off with the file. So, even while object.size and
>> all.equal report that both 'lm's are equal and of small size, one
> saves

>> as a 24MB file and the other as 646 bytes. These seems to be because
> in

>> the first example the function environment is saved in attr(x1$terms,
>> ".Environment") and takes up all 24MB of space.
>>
>> Anyway, I think this is a bug, or if nothing else very undesirable
> (that

>> an object reported to be 0.5kb takes up 24MB). There also seems to be
>> some inconsistency on how environments are saved depending on if it is
>> the global environment or not, though I'm not familiar enough with
>> environments to know if this was intentional. Comments are
> appreciated.

>>
>> Thanks,
>> Robert
>>
>> ##################################################################
>> testEq <- function(B) {
>> x <- lm(y ~ x1+x2+x3, data=B, model=FALSE)
>> x$residuals <- x$effects <- x$fitted.values <- x$qr$qr <- NULL
>> x
>> }
>>
>> N <- 900000
>> B <- data.frame(y=rnorm(N)+1:N, x1=rnorm(N)+1:N, x2=rnorm(N)+1:N,
>> x3=rnorm(N)+1:N)
>> x1 <- testEq(B)
>> x2 <- lm(y ~ x1+x2+x3, data=B, model=FALSE)
>> x2$residuals <- x2$effects <- x2$fitted.values <- x2$qr$qr <- NULL
>>
>> all.equal(x1, x2) ## TRUE
>> object.size(x1) ## 5112
>> object.size(x2) ## 5112
>> save(x1, file="x1.RData")
>> save(x2, file="x2.RData")
>> file.info("x1.RData")$size ## 24063852 bytes
>> file.info("x2.RData")$size ## 646 bytes
>>
>> > R.version
>> _
>> platform i686-pc-linux-gnu
>> arch i686
>> os linux-gnu
>> system i686, linux-gnu
>> status
>> major 2
>> minor 5.0
>> year 2007
>> month 04
>> day 23
>> svn rev 41293
>> language R
>> version.string R version 2.5.0 (2007-04-23)
>>
>>
>> Robert McGehee, CFA
>> Quantitative Analyst
>> Geode Capital Management, LLC
>> One Post Office Square, 28th Floor | Boston, MA | 02109
>> Tel: 617/392-8396 Fax:617/476-6389
>> mailto:robert.mcgehee_at_geodecapital.com
>>
>>
>>
>> This e-mail, and any attachments hereto, are intended for
> us...{{dropped}}

>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>

R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 03 Jul 2007 - 16:33:53 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 03 Jul 2007 - 16:35:58 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.