Re: [R] More compact form of lm object that can be used for prediction?

From: Marc Schwartz <marc_schwartz_at_comcast.net>
Date: Fri, 11 Jul 2008 15:54:06 -0500

on 07/11/2008 02:02 PM Woolner, Keith wrote:

>> From: Marc Schwartz [mailto:marc_schwartz_at_comcast.net]
>> Sent: Friday, July 11, 2008 12:14 PM
>>
>> on 07/11/2008 10:50 AM Woolner, Keith wrote:
>>> Hi everyone,
>>>
>>>
>>>
>>> Is there a way to take an lm() model and strip it to a minimal form

> (or
>>> convert it to another type of object) that can still used to predict

> the
>>> dependent variable?
>> <snip>
>>
>> Depending upon how much memory you need to conserve and what else you
>> may need to do with the model object:
>>
>> 1. lm(YourFormula, data = YourData, model = FALSE)
>>
>> 'model = FALSE' will result in the model frame not being retained.
>>
>> 2. lm(YourFormula, data = YourData, model = FALSE, x = FALSE)
>>
>> 'x = FALSE' will result in the model matrix not being retained.
>>
>> See ?lm for more information.

>
> Marc,
>
> Thank you for the suggestions. Though I neglected to mention it, I had
> already consulted ?lm and was using model=FALSE. x=FALSE is the default
> setting and I had left it unchanged.
>
> The problem I still face is that the memory usage is dominated by the
> "qr" component of the model, consuming nearly 80% of the total
> footprint. Using model=FALSE and x=FALSE saves a little over 4% of
> model size, and if I deliberately clobber some other components, as
> shown below, I can get about boost that to about 20% savings while still
> being able to use predict().
>
> lm.1$fitted.values <- NULL
> lm.1$residuals <- NULL
> lm.1$weights <- NULL
> lm.1$effects <- NULL
>
> The lm() object after doing so is still around 52 megabytes
> (object.size(lm.1) = 51,611,888), with 99.98% of it being used by
> lm.1$qr. That was the motivation behind my original question, which was
> whether there's a way to get predictions from a model without keeping
> the "qr" component around. Especially since I want to create and use
> six of these models simultaneously.
>
> My hope is to save and deploy the models in a reporting system to
> generate predictions on a daily basis as new data comes in, while the
> model itself would change only infrequently. Hence, I am more concerned
> with being able to retain the predictive portion of the models in a
> concise format, and less concerned with keeping the supporting
> analytical detail around for this application.
>
> The answer may be that what I'm seeking to do isn't possible with the
> currently available R+packages, although I'd be mildly surprised if
> others haven't run into this situation before. I just wanted to make
> sure I wasn't missing something obvious.
>
> Many thanks,
> Keith
>

I was hoping that you might save more (or at least 'enough') memory by not including the model frame and matrix.

 From what I can tell of a quick review of the code, without the 'qr' component of the lm model object, you will lose the ability to use the predict.lm() function. In fact, you even lose the ability to use summary.lm(), among other related functions.

If the only thing that you need to do is to use the final models to run predictions on new data, all you really need is the correct encoding, contrasts and any transforms of the IV's and the resultant coefficients from the source models and code your program around those parameters.

If the models are not going to change 'too frequently' (a relative term to be sure), I would not worry about spending a lot of time automating the processes. You can easily hard code the mechanics as they do change once the basic framework is in place.

A possibility would be to create a design matrix from the new incoming data and then use matrix multiplication against the coefficients to generate the predictions.

For example, using the very simplistic model from ?predict.lm, we get:

x <- rnorm(15)
y <- x + rnorm(15)

my.lm <- lm(y ~ x)

my.coef <- coef(my.lm)

 > my.coef
(Intercept) x

   -0.232839 1.455494

# Create some new 'x' data for prediction new <- data.frame(x = seq(-3, 3, 0.5))

# Create a design matrix from the new data my.mm <- model.matrix(~ x, new)

# Now create the predicted 'y' values using the new 'x' data  > my.mm %*% my.coef

         [,1]

1  -4.599321
2  -3.871574
3  -3.143827
4  -2.416080
5  -1.688333
6  -0.960586
7  -0.232839
8   0.494908
9   1.222655
10  1.950402
11  2.678149

12 3.405896
13 4.133643

# How does that compare with using predict.lm()?  > predict(my.lm, new)

         1 2 3 4 5 6 7 -4.599321 -3.871574 -3.143827 -2.416080 -1.688333 -0.960586 -0.232839

         8 9 10 11 12 13   0.494908 1.222655 1.950402 2.678149 3.405896 4.133643

See ?model.matrix for more information.

An alternative of course would be to move to a 64 bit platform, under which you would have access to a much larger RAM footprint.

HTH, Marc



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 11 Jul 2008 - 20:58:30 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 14 Jul 2008 - 16:31:58 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive