Re: [R] Estimate of variance and prediction for multiple linear regression

From: Gavin Simpson <gavin.simpson_at_ucl.ac.uk>
Date: Thu, 24 Jun 2010 08:16:16 +0100

On Wed, 2010-06-23 at 16:30 -0700, cc super wrote:
> What if the size of the newdata is different from the previous one
> used to generate the regression model?
>
> Let's say
>
> pdat <- data.frame(x1 = rnorm(5, 2), x2 = rnorm(5))
> predict(lin, pdat)
>
> It comes up with warning and the result is not correct.

No it doesn't (not on my system at least) and you give no indication what the warning/error is or why you think the result is incorrect.

set.seed(1234)
y <- rnorm(10, mean = 5)
dat <- data.frame(x1 = rnorm(10, mean = 2),

                  x2 = rnorm(10))

lin <- lm(y~x1+x2, data = dat)
pdat <- data.frame(x1 = rnorm(5, 2), x2 = rnorm(5)) predict(lin, pdat)

> predict(lin, pdat)

       1 2 3 4 5 4.971021 6.482085 5.742130 4.652898 5.194980

The size of newdata shouldn't matter at all:

> pdat2 <- data.frame(x1 = rnorm(1, 2), x2 = rnorm(1))
> predict(lin, pdat2)

       1
4.185984
> pdat3 <- data.frame(x1 = rnorm(1000, 2), x2 = rnorm(1000))
> head((p3 <- predict(lin, pdat3)))

       1 2 3 4 5 6 4.076389 4.118338 4.054443 3.976399 4.055590 4.126018
> length(p3)

[1] 1000

So can you show the exact code you used plus the errors/warnings you report and why you think the results are incorrect?

For example, automating your by-hand computation a bit shows that predict() produces the correct result. Compare:

> ## The cbind(1, ....) bit is for the intercept...
> data.matrix(cbind(1, pdat)) %*% coef(lin)

         [,1]

[1,] 4.178516
[2,] 4.101112
[3,] 4.081370
[4,] 4.146819
[5,] 4.186405

> predict(lin, pdat)
1 2 3 4 5
4.178516 4.101112 4.081370 4.146819 4.186405

HTH G

>
> Thanks!
>
>
> 2010/6/23 Gavin Simpson <gavin.simpson_at_ucl.ac.uk>
> On Tue, 2010-06-22 at 23:11 -0700, cc super wrote:
> > Hi, everyone,
> >
> > Night. I have three questions about multiple linear
> regression in R.
> >
> > Q1:
> >
> > y=rnorm(10,mean=5)
> > x1=rnorm(10,mean=2)
> > x2=rnorm(10)
> > lin=lm(y~x1+x2)
> > summary(lin)
> >
> > ## In the summary, 'Residual standard error: 1.017 on 7
> degrees of freedom',
> > 1.017 is the estimate of the constance variance?
>
>
> Yes, it is sigma.
>
> Just a note, in order for the above code to yield the same
> results as
> you quote, you need a call to set.seed() to fix the pseudo
> random number
> generator.
>
> > Q2:
> >
> > beta0=lin$coefficients[1]
> > beta1=lin$coefficients[2]
> > beta2=lin$coefficients[3]
> >
> > y_hat=beta0+beta1*x1+beta2*x2
> >
> > ## Is there any built-in function in R to obtain y_hat
> directly?
>
>
> fitted(lin)
>
> Note that there are quite a few standard extractor functions
> like fitted
> available for modelling functions in R. coef() for example
> should be
> used to extract the coefficients, resid() will extract
> residuals etc.
>
> > Q3:
> >
> > If I want to apply this regression result to another
> dataset, that is, new
> > x1 and x2. Is the built-in function in 2 still available?
>
>
> It is called predict() (although if you called predict(lin)
> above
> instead of fitted(lin) it would have produced the same answer;
> the
> fitted values for the observations).
>
> One gotcha that catches people out is that in the new dataset,
> the
> variables (used in the model) must have the same names as the
> data frame
> used to fit it. So we could do:
>
> pdat <- data.frame(x1 = rnorm(10, 2), x2 = rnorm(10))
> predict(lin, pdat)
>
> to get predictions at the new values of x1 an x2.
>
> > Thank you in advance!
>
> HTH
>
> G
>
> --
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~
> %~%~%~%
> Dr. Gavin Simpson [t] +44 (0)20 7679 0522
> ECRC, UCL Geography, [f] +44 (0)20 7679 0565
> Pearson Building, [e]
> gavin.simpsonATNOSPAMucl.ac.uk
> Gower Street, London [w]
> http://www.ucl.ac.uk/~ucfagls/
> UK. WC1E 6BT. [w]
> http://www.freshwaters.org.uk
> %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~
> %~%~%~%
>
>

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 24 Jun 2010 - 07:20:01 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 24 Jun 2010 - 07:30:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive