# Re: [R] Question about variable selection

From: John Fox <jfox_at_mcmaster.ca>
Date: Sun 19 Feb 2006 - 07:22:57 EST

Dear Wensui,

I don't think that it's possible to answer these questions mechanically, especially if you're interested in the "true" relationship between the response and a set of explanatory variables. If, however, you have a pure prediction problem, then variable selection is a more reasonable approach, as long as it's done carefully (in my opinion).

I don't see how resampling and repeatedly examining the marginal relationship between Y and an X is relevant to the question of whether there is a partial relationship in the absence of a marginal relationship. (This is close to what Wittgenstein once called buying two copies of the same newspaper to see whether what was said in the first one is true.) After all, as I said (and as you understand), the partial and marginal relationship can differ -- so evidence about the marginal relationship is not necessarily relevant to inference about the partial relationship. (As well, bootstrapping a linear least-squares regression likely isn't going to give you much additional information anyway.)

Regards,
John

John Fox
Department of Sociology
McMaster University
Hamilton, Ontario
905-525-9140x23604
http://socserv.mcmaster.ca/jfox

> -----Original Message-----
> From: r-help-bounces@stat.math.ethz.ch
> [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Wensui Liu
> Sent: Saturday, February 18, 2006 3:03 PM
> To: John Fox
> Cc: r-help@stat.math.ethz.ch
> Subject: Re: [R] Question about variable selection
>
> Dear John,
>
> I fully understand your point that a IV might not be
> significantly correlated with DV in bivariate situation but
> might be significantly correlated with DV with the presense
> of other IVs. But does this significant partial relationship
> reflect the true relation between IV and DV and really help
> to predict DV?
>
> >From here, let's go one step further. If I do multiple
> resampling from
> original dataset, build bivariate LM between IV and DV with
> different samples, and still can't get significant result, do
> you think I should give a chance to this IV by looking at its
> partial relationship with DV?
>
> Thank you so much!
>
> On 2/18/06, John Fox <jfox@mcmaster.ca> wrote:
> >
> > Dear Wensui and Andy,
> >
> > When the explanatory variables are correlated it's
> perfectly possible
> > for the marginal relationship between and X and Y to be zero and a
> > partial relationship nonzero (even in the absence of
> interactions) --
> > this is simply a reflection of the more general point that
> partial and
> > marginal relationships can differ.
> >
> > Regards,
> > John
> >
> > --------------------------------
> > John Fox
> > Department of Sociology
> > McMaster University
> > Hamilton, Ontario
> > 905-525-9140x23604
> > http://socserv.mcmaster.ca/jfox
> > --------------------------------
> >
> > > -----Original Message-----
> > > From: r-help-bounces@stat.math.ethz.ch
> > > [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Wensui Liu
> > > Sent: Saturday, February 18, 2006 2:03 PM
> > > To: Liaw, Andy
> > > Cc: r-help@stat.math.ethz.ch
> > > Subject: Re: [R] Question about variable selection
> > >
> > >
> > > But what if I am only interesed in main effects instead of
> > > interactions?
> > >
> > >
> > >
> > > On 2/18/06, Liaw, Andy <andy_liaw@merck.com> wrote:
> > > >
> > > > That depends on whether the IV could have some significant
> > > > interactions with other Ivs not considered in the bivariate
> > > analysis.
> > > > E.g.,
> > > >
> > > > > iv <- expand.grid(-2:2, -2:2)
> > > > > y <- 3 + iv[,1] * iv[,2] + rnorm(nrow(iv), sd=0.1)
> summary(lm(y
> > > > > ~
> > > > > iv[,1]))
> > > >
> > > > Call:
> > > > lm(formula = y ~ iv[, 1])
> > > >
> > > > Residuals:
> > > > Min 1Q Median 3Q Max
> > > > -4.06259 -1.06048 -0.02377 1.05901 4.04315
> > > >
> > > > Coefficients:
> > > > Estimate Std. Error t value Pr(>|t|)
> > > > (Intercept) 3.01908 0.41482 7.278 2.09e-07 ***
> > > > iv[, 1] 0.01417 0.29332 0.048 0.962
> > > > ---
> > > > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> > > >
> > > > Residual standard error: 2.074 on 23 degrees of freedom Multiple
> > > > R-Squared: 0.0001014, Adjusted R-squared: -0.04337
> > > > F-statistic: 0.002333 on 1 and 23 DF, p-value: 0.9619
> > > >
> > > > > summary(lm(y ~ iv[,1] * iv[,2]))
> > > >
> > > > Call:
> > > > lm(formula = y ~ iv[, 1] * iv[, 2])
> > > >
> > > > Residuals:
> > > > Min 1Q Median 3Q Max
> > > > -0.22390 -0.08894 -0.01279 0.13525 0.17608
> > > >
> > > > Coefficients:
> > > > Estimate Std. Error t value Pr(>|t|)
> > > > (Intercept) 3.019083 0.026330 114.665 <2e-16 ***
> > > > iv[, 1] 0.014167 0.018618 0.761 0.455
> > > > iv[, 2] -0.005486 0.018618 -0.295 0.771
> > > > iv[, 1]:iv[, 2] 0.992865 0.013165 75.418 <2e-16 ***
> > > > ---
> > > > Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> > > >
> > > > Residual standard error: 0.1316 on 21 degrees of freedom
> > > > Multiple R-Squared: 0.9963, Adjusted R-squared: 0.9958
> > > > F-statistic: 1896 on 3 and 21 DF, p-value: < 2.2e-16
> > > >
> > > >
> > > >
> > > >
> > > > Andy
> > > >
> > > > From: Wensui Liu
> > > > >
> > > > > Dear Lister,
> > > > >
> > > > > I have a question about variable selection for regression.
> > > > >
> > > > > if the IV is not significantly related to DV in the bivariate
> > > > > analysis, does it make sense to include this IV into the
> > > full model
> > > > > with multiple IVs?
> > > > >
> > > > > Thank you so much!
> > > > >
> > > > > [[alternative HTML version deleted]]
> > > > >
> > > > > ______________________________________________
> > > > > R-help@stat.math.ethz.ch mailing list
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > http://www.R-project.org/posting-guide.html
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> --------------------------------------------------------------------
> > > --
> > > > --------
> > > > Notice: This e-mail message, together with any
> > > > attachment...{{dropped}}
> > >
> > > ______________________________________________
> > > R-help@stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > http://www.R-project.org/posting-guide.html
> >
> >
>
>
> --
> WenSui Liu
> (http://statcompute.blogspot.com)
> Senior Decision Support Analyst
> Health Policy and Clinical Effectiveness Cincinnati Children
> Hospital Medical Center
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> http://www.R-project.org/posting-guide.html

R-help@stat.math.ethz.ch mailing list