From: Frank E Harrell Jr <f.harrell_at_vanderbilt.edu>

Date: Thu 03 Mar 2005 - 10:23:49 EST

}

research;overoptimism;overfitting;penalization;bootstrapping;shrinkage} }

*>
*

*>
*

*> ***********************************************************************
*

*> Christian Hennig
*

> Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg

*> hennig_at_math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
*

*> ich empfehle www.boag-online.de
*

*>
*

*>
*

Date: Thu 03 Mar 2005 - 10:23:49 EST

Christian Hennig wrote:

> Perhaps I should not write it because I will discredit myself with this

*> but...
**>
**> Suppose I have a setup with 100 variables and some 1000 cases and I want to
**> boil down the number of variables to a maximum of 10 for practical reasons
**> even if I lose 10% prediction quality by this (for example because it is
**> expensive to measure all variables on new cases).
**>
**> Is it really so wrong to use a stepwise method?
*

Yes. Read about model uncertainty and bias in models developed using stepwise methods. One exception: if there is a large number of variables with truly zero regression coefficients, and the rest are not very weak, stepwise can sort things out fairly well. But you never know this in advance.

> Let's say I divide the sample into three parts and do variable selction on

*> the first part, estimation on the second and test on the third part (this
**> solves almost all problems Frank is talking about on p. 56/57 in his
**> excellent book). Is there always a tractable alternative?
*

That's a good way to find out how bad the method is, not to fix the problems inherent in it.

*>
*

> Of course it is wrong to interpret the selected variables as "the true

*> influences" and all others as "unrelated", but if I don't do that?
**>
**> If it should really be a taboo to do stepwise variable selection, why are p.
**> 58/59 of "Regression Modeling Strategies" devoted to "how to do it of you
**> must"?
*

Stress on "if". And note that if you ask what is the optimum alpha for variables to be kept in the model when doing backwards stepdown, it's alpha=1.0. A good compromise is alpha=0.5. See

@Article{ste01pro,

author = {Steyerberg, Ewout W. and Eijkemans, Marinus J. C. and Harrell, Frank E. and Habbema, J. Dik F.}, title = {Prognostic modeling with logistic regression analysis: {In} search of a sensible strategy in small data sets},

journal = Medical Decision Making, year = 2001, volume = 21, pages = {45-56}, annote = {shrinkage; variable selection; dichotomization ofcontinuous varibles; sign of regression coefficient; calibration; validation}

}

And on Bert's excellent question about why shrinkage is not used more often, here is our attempt at a remedy:

@Article{moo04pen,

author = {Moons, K. G. M. and Donders, A. Rogier T. and Steyerberg, E. W. and Harrell, F. E.},

title = {Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example},

journal = J Clinical Epidemiology, year = 2004, volume = 57, pages = {1262-1270}, annote = {prediction

research;overoptimism;overfitting;penalization;bootstrapping;shrinkage} }

Frank

*>
*

> Please forget my name;-)

*>
**> Christian
**>
**> On Wed, 2 Mar 2005, Berton Gunter wrote:
**>
**>
*

>>To clarify Frank's remark ... >> >>A prominent theme in statistical research over at least the last 25 years >>(with roots that go back 50 or more, probably) has been the superiority of >>"shrinkage" methods over variable selection. I also find it distressing that >>these ideas have apparently not penetrated much (at all?) into the wider >>scientific community (but I suppose I shouldn't be surprised -- most >>scientists still do one factor at a time experiments 80 years after Fisher). >>Specific incarnations can be found in anything Bayesian, mixed effects >>models for repeated measures, ridge regression, and the R packages lars and >>lasso, among others. >> >>I would speculate that aside from the usual statistics/science cultural >>issues, part of the reason for this is that the estimators don't generally >>come with neat, classical inference procedures: like it or not, many >>scientists have been conditioned by their Stat 101 courses to expect P >>values, so in some sense, we are hoisted by our own petard. >> >>Just my $.02 -- contrary(and more knowledgeable) opinions welcome. >> >>-- Bert Gunter >> >> >> >>>-----Original Message----- >>>From: r-help-bounces@stat.math.ethz.ch >>>[mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Frank >>>E Harrell Jr >>>Sent: Wednesday, March 02, 2005 5:13 AM >>>To: Wittner, Ben >>>Cc: r-help@lists.R-project.org >>>Subject: Re: [R] subset selection for logistic regression >>> >>>Wittner, Ben wrote: >>> >>>>R-packages leaps and subselect implement various methods of >>> >>>selecting best or >>> >>>>good subsets of predictor variables for linear regression >>> >>>models, but they do >>> >>>>not seem to be applicable to logistic regression models. >>>> >>>>Does anyone know of software for finding good subsets of >>> >>>predictor variables for >>> >>>>linear regression models? >>>> >>>>Thanks. >>>> >>>>-Ben >>> >>>Why are these procedures still being used? The performance >>>is known to >>>be bad in almost every sense (see r-help archives). >>> >>>Frank Harrell >>> >>> >>>> >>>>p.s., The leaps package references "Subset Selection in >>> >>>Regression" by Alan >>> >>>>Miller. On page 2 of the >>>>2nd edition of that text it states the following: >>>> >>>> "All of the models which will be considered in this >>> >>>monograph will be linear; >>> >>>>that is they >>>> will be linear in the regression coefficients.Though >>> >>>most of the ideas and >>> >>>>problems carry >>>> over to the fitting of nonlinear models and generalized >>> >>>linear models >>> >>>>(particularly the fitting >>>> of logistic relationships), the complexity is greatly increased." >>> >>> >>>-- >>>Frank E Harrell Jr Professor and Chair School of Medicine >>> Department of Biostatistics >>>Vanderbilt University >>> >>>______________________________________________ >>>R-help@stat.math.ethz.ch mailing list >>>https://stat.ethz.ch/mailman/listinfo/r-help >>>PLEASE do read the posting guide! >>>http://www.R-project.org/posting-guide.html >>> >> >>______________________________________________ >>R-help@stat.math.ethz.ch mailing list >>https://stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >>

> Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg

>>From 1 April 2005: Department of Statistical Science, UCL, London

> #######################################################################

-- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.htmlReceived on Thu Mar 03 10:54:14 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:30:39 EST
*