[R] Model validation and penalization with rms package

From: Mark Seeto <markseeto_at_gmail.com>
Date: Tue, 29 Jun 2010 11:31:04 +1000


Iíve been using Frank Harrellís rms package to do bootstrap model validation. Is it the case that the optimum penalization may still give a model which is substantially overfitted?

I calculated corrected R^2, optimism in R^2, and corrected slope for various penalties for a simple example:

x1 <- rnorm(45)
x2 <- rnorm(45)
x3 <- rnorm(45)

y <- x1 + 2*x2 + rnorm(45,0,3)

ols0 <- ols(y ~ x1 + x2 + x3, x=TRUE, y=TRUE)

corrected.Rsq <- rep(0,60)
optimism.Rsq <- rep(0,60)
corrected.slope <- rep(0,60)

for (pen in 1:60) {
olspen <- ols(y ~ x1 + x2 + x3, penalty=pen, x=TRUE, y=TRUE) val <- validate(olspen, B=200)
corrected.Rsq[pen] <- val["R-square", "index.corrected"] optimism.Rsq[pen] <- val["R-square", "optimism"] corrected.slope[pen] <- val["Slope", "index.corrected"] }
plot(corrected.Rsq)
x11(); plot(optimism.Rsq)
x11(); plot(corrected.slope)
p <- pentrace(ols0, penalty=1:60)
ols9 <- ols(y ~ x1 + x2 + x3, penalty=9, x=TRUE, y=TRUE) validate(ols9, B=200)

 		index.orig  training       test           optimism    index.corrected   n
R-square	0.2523404 0.2820734  0.1911878  0.09088563       0.1614548 200
MSE 	7.8497722 7.3525300  8.4918212 -1.13929116       8.9890634 200
Intercept   0.0000000 0.0000000 -0.1353572  0.13535717      -0.1353572 200
Slope       1.0000000 1.0000000  1.1707137 -0.17071372       1.1707137 200

pentrace tells me that of the penalties 1, 2,..., 60, corrected AIC is maximised by a penalty of 9. This is consistent with the corrected R^2 plot, which shows a maximum somewhere around 10. However, a penalty of 9 still gives an R^2 optimism of 0.09 (training R^2=0.28, test R^2=0.19), suggesting overfitting.

Do we just have to live with this R^2 optimism? It can be decreased by taking a bigger penalty, but then the corrected R^2 is reduced. Also, a penalty of 9 gives a corrected slope of about 1.17 (corrected slope of 1 is achieved with a penalty of about 1 or 2).

Thanks for any help/advice you can give.

Mark

--
Mark Seeto
Statistician

National Acoustic Laboratories
A Division of Australian Hearing

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Tue 29 Jun 2010 - 01:34:58 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 29 Jun 2010 - 04:50:42 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive