# Re: [R] Question about variable selection

From: William Revelle <lists_at_revelle.net>
Date: Sun 19 Feb 2006 - 09:22:13 EST

Dear Wensui,

What you are asking about is called in psychology a "suppressor" variable: a predictor variable unrelated to the criterion but correlated with the other predictors. (X1 in the following example) Although it has a zero relationship with the DV, it does "really" help to predict the DV by removing extraneous variance from the other IVs. (I am not going to touch the Wittgenstein issue of truth here). Should it be included in the predictor set? Yes. Is there any easy way to find all possible suppressors? No.

Consider the following:

#demonstration of "suppressor effects"
library(mvtnorm)
sigma <- matrix(c(1,.5,0,.5,1,.5,0,.5,1),ncol=3) my.data <- data.frame(rmvnorm(1000,sigma=sigma)) names(my.data) <- c("X1", "X2", "Y")
round(cor(my.data),2)
summary(lm(Y~ X1 + X2,data= my.data))

which produces

X1 X2 Y

```X1  1.00 0.45 -0.04
X2  0.45 1.00  0.51
Y  -0.04 0.51  1.00
```

> summary(lm(Y~ X1 + X2,data= my.data))

Call:
lm(formula = Y ~ X1 + X2, data = my.data)

Residuals:

Min 1Q Median 3Q Max -2.09350 -0.58069 0.02280 0.53436 3.02017

Coefficients:

```             Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.02807    0.02557   1.098    0.273
X1          -0.32849    0.02813 -11.680   <2e-16 ***
X2           0.65666    0.02861  22.951   <2e-16 ***
```
```---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8081 on 997 degrees of freedom
Multiple R-Squared: 0.3465,	Adjusted R-squared: 0.3452
F-statistic: 264.4 on 2 and 997 DF,  p-value: < 2.2e-16

At 3:22 PM -0500 2/18/06, John Fox wrote:

>Dear Wensui,
>
>I don't think that it's possible to answer these questions mechanically,

>especially if you're interested in the "true" relationship between the
>response and a set of explanatory variables. If, however, you have a pure
>prediction problem, then variable selection is a more reasonable approach,
>as long as it's done carefully (in my opinion).
>
>I don't see how resampling and repeatedly examining the marginal
>relationship between Y and an X is relevant to the question of whether there
>is a partial relationship in the absence of a marginal relationship. (This
>is close to what Wittgenstein once called buying two copies of the same
>newspaper to see whether what was said in the first one is true.) After all,
>as I said (and as you understand), the partial and marginal relationship can
>differ -- so evidence about the marginal relationship is not necessarily
>relevant to inference about the partial relationship. (As well,
>bootstrapping a linear least-squares regression likely isn't going to give
>
>Regards,
>  John
>
>--------------------------------
>John Fox
>Department of Sociology

.... (discussion of interaction from Andy Liaw)

>  > > > From: Wensui Liu

>  > > > >
>  > > > > Dear Lister,
>  > > > >
>  > > > > I have a question about variable selection for regression.
>  > > > >
>  > > > > if the IV is not significantly related to DV in the bivariate
>  > > > > analysis, does it make sense to include this IV into the
>  > > full model
>  > > > > with multiple IVs?
>  > > > >
>  > > > > Thank you so much!

--
William Revelle		http://pmc.psych.northwestern.edu/revelle.html
Professor			http://personality-project.org/personality.html
Department of Psychology       http://www.wcas.northwestern.edu/psych/
Northwestern University	http://www.northwestern.edu/

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help