Re: [R] stepAIC and polynomial terms

From: <Bill.Venables_at_csiro.au>
Date: Mon, 17 Mar 2008 14:01:41 +1000

There is absolutely no reason to remove age altogether. Notice that typically age and age^2 are highly correlated. To see this, consider 100 people with mean age 35 and 95% tolerance limite between 15 and 55:

> age <- rnorm(100, 35, 10)
> cor(age, age^2)
[1] 0.9898186

So if you use raw age and I(age^2) as predictors, it's really just the luck of the draw which gets selected (usually), and they will do much the same job when it comes to prediction, of course.

So what are McCullagh and Nelder on about? One way to look at it is as a policy issue. In a mathematical sense you would think that whether you used age as the predictor or (age - 35) ("years away from mid-life") should not make any difference, and if in you model selection procedure it does make a difference, then something arbitrary is going on, and any arbitrariness in this game is often a precursor of trouble to come. Consider the correlations again:

> cor(age-35, (age-35)^2)

[1] -0.1302315

One way to *encourage* the linear term to be chosen ahead of the quadratic term is, in fact, to mean correct the predictor:

sAge <- age - mean(age)

and use sAge and I(sAge^2) as your predictors. I expect this will favour the linear term over the quadratic and you will be led to a model that has no quadratic term, even if, in a strictly mathematical sense, the starting models were entirely equivalent. (Beware if you do this, though, you make things difficult when it comes to prediction.)

You draw attention to a bit of a gap in the software, in my view. In variable selection with functions line stepAIC you would like to be able to specify a set of marginality constraints (to use the McCullagh and Nelder term) that you would like the model sifting process to respect, in order to ensure invariance with respect to groups of transformations that are natural to the problem. In this case you would like to declare that 1 is marginal to age which in turn is marginal to (age^2), to ensure invariance with respect to the action of the location and scale group, as seems natural. Why should changing the origin and unit of measurement have any consequences for the model selection process? Notice that in the case of factors and interactions this happens already: main effect terms will not be dropped if interactions involving them are still present. It's a similar argument. The same feature, ideally, should be available for other cases where marginality issues are at stake, but doing that seems to be a tricky problem. Using it might be trickier still. People would have to think about group invariance properties and that's foreign to most people...

Bill Venables.

-----Original Message-----
From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On Behalf Of caspar
Sent: Monday, 17 March 2008 11:50 AM
To: r-help_at_r-project.org
Subject: [R] stepAIC and polynomial terms

Dear all,
I have a question regarding the use of stepAIC and polynomial (quadratic to be specific) terms in a binary logistic regression model. I read in McCullagh and Nelder, (1989, p 89) and as far as I remember from my statistics cources, higher-degree polynomial effects should not be included without the main effects. If I understand this correctly, following a stepwise model selection based on AIC should not lead to a model where the main effect of some continuous covariate is removed, but the quadratic term is kept.
The question is, should I keep the quadratic term (note, there are other main effects that were retained following the stepwise algorithm) in the final model or should I delete it as well and move on? Or should I retain the main effect as well?

To picture it, the initial model to which I called stepAIC is:

Call: glm(formula = S ~ FR + Date * age + I(age^2), family = logexposure(ExposureDays = DATA$int), data = DATA)

and the final one:

Call: glm(formula = S ~ FR + Date + I(age^2), family = logexposure(ExposureDays = DATA$int), data = DATA)

Thanks very much in advance for your thoughts and suggestions,

Caspar

Caspar Hallmann
MSc Student WUR
The Netherlands

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 17 Mar 2008 - 04:11:24 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 17 Mar 2008 - 04:30:22 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive