Re: [R] R2 always increases as variables are added?

From: 李俊杰 <klijunjie_at_gmail.com>
Date: Tue, 22 May 2007 12:08:45 +0800

Hi, Lynch,

Thank you for attention first.

I am also not a statistician and have just taken several statistics classes. So it is natral for us to ask some question seeming naive to statisticans.

I am sorry that I cannot agree with your point that we must always include intercept in our model. becaus if true intercept is zero, the strategy of you or your textbook will be have 2 losses. First, there will be explaination problem. If true intercept is zero and your estimate of it is not zero, the result of regression is misleading. However, it might be not so serious as we judge those coefficients which are actually zeros to be none-zeros, but the misjudge here is still a loss in some extent. Secondly, if true intercept is zero, your strategy's predictive ability is often lower than other strategies which do not always include intercept.

If you are interested in the performance of your strategies, e.g. maximizing adjusted R^2 always with intercept. you can run the code I put in the attachment.
It will show that maximizing adjusted R^2 NOT always with intercept beats maximizing adjusted R^2 always with intercept.

Junjie

2007/5/22, Paul Lynch <plynchnlm_at_gmail.com>:
>
> Junjie,
> First, a disclaimer: I am not a statistician, and have only taken
> one statistics class, but I just took it this Spring, so the concepts
> of linear regression are relatively fresh in my head and hopefully I
> will not be too inaccurate.
> According to my statistics textbook, when selecting variables for
> a model, the intercept term is always present. The "variables" under
> consideration do not include the constant "1" that multiplies the
> intercept term. I don't think it makes sense to compare models with
> and without an intercept term. (Also, I don't know what the point of
> using a model without an intercept term would be, but that is probably
> just my ignorance.)
> Similarly, the formula you were using for R**2 seems to only be
> useful in the context of a standard linear regression (i.e., one that
> includes an intercept term). As your example shows, it is easy to
> construct a "fit" (e.g. y = 10,000,000*x) so that SSR > SST if one is
> not deriving the fit from the regular linear regression process.
> --Paul
>
> On 5/19/07, 李俊杰 <klijunjie_at_gmail.com> wrote:
> > I know that "-1" indicates to remove the intercept term. But my question
> is
> > why intercept term CAN NOT be treated as a variable term as we place a
> > column consited of 1 in the predictor matrix.
> >
> > If I stick to make a comparison between a model with intercept and one
> > without intercept on adjusted r2 term, now I think the strategy is
> always to
> > use another definition of r-square or adjusted r-square, in which
> > r-square=sum(( y.hat)^2)/sum((y)^2).
> >
> > Am I in the right way?
> >
> > Thanks
> >
> > Li Junjie
> >
> >
> > 2007/5/19, Paul Lynch <plynchnlm_at_gmail.com>:
> > > In case you weren't aware, the meaning of the "-1" in y ~ x - 1 is to
> > > remove the intercept term that would otherwise be implied.
> > > --Paul
> > >
> > > On 5/17/07, 李俊杰 <klijunjie_at_gmail.com> wrote:
> > > > Hi, everybody,
> > > >
> > > > 3 questions about R-square:
> > > > ---------(1)----------- Does R2 always increase as variables are
> added?
> > > > ---------(2)----------- Does R2 always greater than 1?
> > > > ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> > > > calculated? It is different from (r.square=sum((y.hat-mean
> > > > (y))^2)/sum((y-mean(y))^2))
> > > >
> > > > I will illustrate these problems by the following codes:
> > > > ---------(1)----------- R2 doesn't always increase as
> > variables are added
> > > >
> > > > > x=matrix(rnorm(20),ncol=2)
> > > > > y=rnorm(10)
> > > > >
> > > > > lm=lm(y~1)
> > > > > y.hat=rep(1*lm$coefficients,length(y))
> > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > > [1] 2.646815e-33
> > > > >
> > > > > lm=lm(y~x-1)
> > > > > y.hat=x%*%lm$coefficients
> > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > > [1] 0.4443356
> > > > >
> > > > > ################ This is the biggest model, but its R2 is not the
> > biggest,
> > > > why?
> > > > > lm=lm(y~x)
> > > > > y.hat=cbind(rep(1,length(y)),x)%*%lm$coefficients
> > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > > [1] 0.2704789
> > > >
> > > >
> > > > ---------(2)----------- R2 can greater than 1
> > > >
> > > > > x=rnorm(10)
> > > > > y=runif(10)
> > > > > lm=lm(y~x-1)
> > > > > y.hat=x*lm$coefficients
> > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > > [1] 3.513865
> > > >
> > > >
> > > > ---------(3)----------- How is R2 in summary(lm(y~x-1))$r.squared
> > > > calculated? It is different from (r.square=sum((y.hat-mean
> > > > (y))^2)/sum((y-mean(y))^2))
> > > > > x=matrix(rnorm(20),ncol=2)
> > > > > xx=cbind(rep(1,10),x)
> > > > > y=x%*%c(1,2)+rnorm(10)
> > > > > ### r2 calculated by lm(y~x)
> > > > > lm=lm(y~x)
> > > > > summary(lm)$r.squared
> > > > [1] 0.9231062
> > > > > ### r2 calculated by lm(y~xx-1)
> > > > > lm=lm(y~xx-1)
> > > > > summary(lm)$r.squared
> > > > [1] 0.9365253
> > > > > ### r2 calculated by me
> > > > > y.hat=xx%*%lm$coefficients
> > > > > (r.square=sum((y.hat-mean(y))^2)/sum((y-mean(y))^2))
> > > > [1] 0.9231062
> > > >
> > > >
> > > > Thanks a lot for any cue:)
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Junjie Li, klijunjie_at_gmail.com
> > > > Undergranduate in DEP of Tsinghua University,
> > > >
> > > > [[alternative HTML version deleted]]
> > > >
> > > > ______________________________________________
> > > > R-help_at_stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible code.
> > > >
> > >
> > >
> > > --
> > > Paul Lynch
> > > Aquilent, Inc.
> > > National Library of Medicine (Contractor)
> > >
> >
> >
> >
> > --
> >
> > Junjie Li, klijunjie_at_gmail.com
> > Undergranduate in DEP of Tsinghua University,
>
>
> --
> Paul Lynch
> Aquilent, Inc.
> National Library of Medicine (Contractor)
>

-- 
Junjie Li,                  klijunjie_at_gmail.com
Undergranduate in DEP of Tsinghua University,

______________________________________________ R-help_at_stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

Received on Tue 22 May 2007 - 04:14:39 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 22 May 2007 - 08:31:41 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.