From: <Bill.Venables_at_csiro.au>

Date: Mon, 17 Mar 2008 14:01:41 +1000

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 17 Mar 2008 - 04:11:24 GMT

Date: Mon, 17 Mar 2008 14:01:41 +1000

There is absolutely no reason to remove age altogether. Notice that typically age and age^2 are highly correlated. To see this, consider 100 people with mean age 35 and 95% tolerance limite between 15 and 55:

> age <- rnorm(100, 35, 10)

> cor(age, age^2)

[1] 0.9898186

So if you use raw age and I(age^2) as predictors, it's really just the luck of the draw which gets selected (usually), and they will do much the same job when it comes to prediction, of course.

So what are McCullagh and Nelder on about? One way to look at it is as a policy issue. In a mathematical sense you would think that whether you used age as the predictor or (age - 35) ("years away from mid-life") should not make any difference, and if in you model selection procedure it does make a difference, then something arbitrary is going on, and any arbitrariness in this game is often a precursor of trouble to come. Consider the correlations again:

> cor(age-35, (age-35)^2)

[1] -0.1302315

One way to *encourage* the linear term to be chosen ahead of the quadratic term is, in fact, to mean correct the predictor:

sAge <- age - mean(age)

and use sAge and I(sAge^2) as your predictors. I expect this will favour the linear term over the quadratic and you will be led to a model that has no quadratic term, even if, in a strictly mathematical sense, the starting models were entirely equivalent. (Beware if you do this, though, you make things difficult when it comes to prediction.)

You draw attention to a bit of a gap in the software, in my view. In variable selection with functions line stepAIC you would like to be able to specify a set of marginality constraints (to use the McCullagh and Nelder term) that you would like the model sifting process to respect, in order to ensure invariance with respect to groups of transformations that are natural to the problem. In this case you would like to declare that 1 is marginal to age which in turn is marginal to (age^2), to ensure invariance with respect to the action of the location and scale group, as seems natural. Why should changing the origin and unit of measurement have any consequences for the model selection process? Notice that in the case of factors and interactions this happens already: main effect terms will not be dropped if interactions involving them are still present. It's a similar argument. The same feature, ideally, should be available for other cases where marginality issues are at stake, but doing that seems to be a tricky problem. Using it might be trickier still. People would have to think about group invariance properties and that's foreign to most people...

To picture it, the initial model to which I called stepAIC is:

Thanks very much in advance for your thoughts and suggestions,

Caspar Hallmann

MSc Student WUR

The Netherlands

[[alternative HTML version deleted]]

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 17 Mar 2008 - 04:11:24 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Mon 17 Mar 2008 - 04:30:22 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*