Re: [R] subset selection for logistic regression

From: Frank E Harrell Jr <f.harrell_at_vanderbilt.edu>
Date: Thu 03 Mar 2005 - 10:23:49 EST

Christian Hennig wrote:
> Perhaps I should not write it because I will discredit myself with this
> but...
>
> Suppose I have a setup with 100 variables and some 1000 cases and I want to
> boil down the number of variables to a maximum of 10 for practical reasons
> even if I lose 10% prediction quality by this (for example because it is
> expensive to measure all variables on new cases).
>
> Is it really so wrong to use a stepwise method?

Yes. Read about model uncertainty and bias in models developed using stepwise methods. One exception: if there is a large number of variables with truly zero regression coefficients, and the rest are not very weak, stepwise can sort things out fairly well. But you never know this in advance.

> Let's say I divide the sample into three parts and do variable selction on
> the first part, estimation on the second and test on the third part (this
> solves almost all problems Frank is talking about on p. 56/57 in his
> excellent book). Is there always a tractable alternative?

That's a good way to find out how bad the method is, not to fix the problems inherent in it.

>
> Of course it is wrong to interpret the selected variables as "the true
> influences" and all others as "unrelated", but if I don't do that?
>
> If it should really be a taboo to do stepwise variable selection, why are p.
> 58/59 of "Regression Modeling Strategies" devoted to "how to do it of you
> must"?

Stress on "if". And note that if you ask what is the optimum alpha for variables to be kept in the model when doing backwards stepdown, it's alpha=1.0. A good compromise is alpha=0.5. See

@Article{ste01pro,

   author = {Steyerberg, Ewout W. and Eijkemans, Marinus    J. C. and Harrell, Frank E. and Habbema, J. Dik F.},    title = {Prognostic modeling with logistic regression    analysis: {In} search of a sensible strategy in small data sets},

   journal = 	 Medical Decision Making,
   year = 		 2001,
   volume =		 21,
   pages =		 {45-56},
   annote =		 {shrinkage; variable selection; dichotomization of
   continuous varibles; sign of regression coefficient; calibration; validation}
}

And on Bert's excellent question about why shrinkage is not used more often, here is our attempt at a remedy:

@Article{moo04pen,

   author = {Moons, K. G. M. and Donders, A. Rogier T. and Steyerberg, E. W. and Harrell, F. E.},

   title = {Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: a clinical example},

   journal = 	 J Clinical Epidemiology,
   year = 		 2004,
   volume =		 57,
   pages =		 {1262-1270},
   annote =		 {prediction 

research;overoptimism;overfitting;penalization;bootstrapping;shrinkage} }

Frank

>
> Please forget my name;-)
>
> Christian
>
> On Wed, 2 Mar 2005, Berton Gunter wrote:
>
>

>>To clarify Frank's remark ...
>>
>>A prominent theme in statistical research over at least the last 25 years
>>(with roots that go back 50 or more, probably) has been the superiority of
>>"shrinkage" methods over variable selection. I also find it distressing that
>>these ideas have apparently not penetrated much (at all?) into the wider
>>scientific community (but I suppose I shouldn't be surprised -- most
>>scientists still do one factor at a time experiments 80 years after Fisher).
>>Specific incarnations can be found in anything Bayesian, mixed effects
>>models for repeated measures, ridge regression, and the R packages lars and
>>lasso, among others.
>>
>>I would speculate that aside from the usual statistics/science cultural
>>issues, part of the reason for this is that the estimators don't generally
>>come with neat, classical inference procedures: like it or not, many
>>scientists have been conditioned by their Stat 101 courses to expect P
>>values, so in some sense, we are hoisted by our own petard.
>>
>>Just my $.02 -- contrary(and more knowledgeable) opinions welcome.
>>
>>-- Bert Gunter
>> 
>>
>>
>>>-----Original Message-----
>>>From: r-help-bounces@stat.math.ethz.ch 
>>>[mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Frank 
>>>E Harrell Jr
>>>Sent: Wednesday, March 02, 2005 5:13 AM
>>>To: Wittner, Ben
>>>Cc: r-help@lists.R-project.org
>>>Subject: Re: [R] subset selection for logistic regression
>>>
>>>Wittner, Ben wrote:
>>>
>>>>R-packages leaps and subselect implement various methods of 
>>>
>>>selecting best or
>>>
>>>>good subsets of predictor variables for linear regression 
>>>
>>>models, but they do
>>>
>>>>not seem to be applicable to logistic regression models.
>>>> 
>>>>Does anyone know of software for finding good subsets of 
>>>
>>>predictor variables for
>>>
>>>>linear regression models?
>>>> 
>>>>Thanks.
>>>> 
>>>>-Ben
>>>
>>>Why are these procedures still being used?  The performance 
>>>is known to 
>>>be bad in almost every sense (see r-help archives).
>>>
>>>Frank Harrell
>>>
>>>
>>>> 
>>>>p.s., The leaps package references "Subset Selection in 
>>>
>>>Regression" by Alan
>>>
>>>>Miller. On page 2 of the
>>>>2nd edition of that text it states the following:
>>>> 
>>>>  "All of the models which will be considered in this 
>>>
>>>monograph will be linear;
>>>
>>>>that is they
>>>>   will be linear in the regression coefficients.Though 
>>>
>>>most of the ideas and
>>>
>>>>problems carry
>>>>   over to the fitting of nonlinear models and generalized 
>>>
>>>linear models
>>>
>>>>(particularly the fitting
>>>>   of logistic relationships), the complexity is greatly increased."
>>>
>>>
>>>-- 
>>>Frank E Harrell Jr   Professor and Chair           School of Medicine
>>>                      Department of Biostatistics   
>>>Vanderbilt University
>>>
>>>______________________________________________
>>>R-help@stat.math.ethz.ch mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide! 
>>>http://www.R-project.org/posting-guide.html
>>>
>>
>>______________________________________________
>>R-help@stat.math.ethz.ch mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>>

>
>
> ***********************************************************************
> Christian Hennig
> Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg
> hennig_at_math.uni-hamburg.de, http://www.math.uni-hamburg.de/home/hennig/
>>From 1 April 2005: Department of Statistical Science, UCL, London

> #######################################################################
> ich empfehle www.boag-online.de
>
>
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Thu Mar 03 10:54:14 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:30:39 EST