Re: [R] Coefficients of Logistic Regression from bootstrap - how to get them?

From: Frank E Harrell Jr <f.harrell_at_vanderbilt.edu>
Date: Mon, 21 Jul 2008 18:22:26 -0500

Michal Figurski wrote:
> Frank,
>
> "How does bootstrap improve on that?"
>
> I don't know, but I have an idea. Since the data in my set are just a
> small sample of a big population, then if I use my whole dataset to
> obtain max likelihood estimates, these estimates may be best for this
> dataset, but far from ideal for the whole population.

The bootstrap, being a resampling procedure from your sample, has the same issues about the population as MLEs.

>
> I used bootstrap to virtually increase the size of my dataset, it should
> result in estimates more close to that from the population - isn't it
> the purpose of bootstrap?

No

>
> When I use such median coefficients on another dataset (another sample
> from population), the predictions are better, than using max likelihood
> estimates. I have already tested that and it worked!

Then your testing procedure is probably not valid.

>
> I am not a statistician and I don't feel what "overfitting" is, but it
> may be just another word for the same idea.
>
> Nevertheless, I would still like to know how can I get the coeffcients
> for the model that gives the "nearly unbiased estimates". I greatly
> appreciate your help.

More info in my book Regression Modeling Strategies.

Frank

>
> --
> Michal J. Figurski
> HUP, Pathology & Laboratory Medicine
> Xenobiotics Toxicokinetics Research Laboratory
> 3400 Spruce St. 7 Maloney
> Philadelphia, PA 19104
> tel. (215) 662-3413
>
> Frank E Harrell Jr wrote:

>> Michal Figurski wrote:
>>> Hello all,
>>>
>>> I am trying to optimize my logistic regression model by using 
>>> bootstrap. I was previously using SAS for this kind of tasks, but I 
>>> am now switching to R.
>>>
>>> My data frame consists of 5 columns and has 109 rows. Each row is a 
>>> single record composed of the following values: Subject_name, 
>>> numeric1, numeric2, numeric3 and outcome (yes or no). All three 
>>> numerics are used to predict outcome using LR.
>>>
>>> In SAS I have written a macro, that was splitting the dataset, 
>>> running LR on one half of data and making predictions on second half. 
>>> Then it was collecting the equation coefficients from each iteration 
>>> of bootstrap. Later I was just taking medians of these coefficients 
>>> from all iterations, and used them as an optimal model - it really 
>>> worked well!
>>
>> Why not use maximum likelihood estimation, i.e., the coefficients from 
>> the original fit.  How does the bootstrap improve on that?
>>
>>>
>>> Now I want to do the same in R. I tried to use the 'validate' or 
>>> 'calibrate' functions from package "Design", and I also experimented 
>>> with function 'sm.binomial.bootstrap' from package "sm". I tried also 
>>> the function 'boot' from package "boot", though without success - in 
>>> my case it randomly selected _columns_ from my data frame, while I 
>>> wanted it to select _rows_.
>>
>> validate and calibrate in Design do resampling on the rows
>>
>> Resampling is mainly used to get a nearly unbiased estimate of the 
>> model performance, i.e., to correct for overfitting.
>>
>> Frank Harrell
>>
>>>
>>> Though the main point here is the optimized LR equation. I would 
>>> appreciate any help on how to extract the LR equation coefficients 
>>> from any of these bootstrap functions, in the same form as given by 
>>> 'glm' or 'lrm'.
>>>
>>> Many thanks in advance!
>>>
>>
>>

>
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Mon 21 Jul 2008 - 23:26:24 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 22 Jul 2008 - 14:31:59 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive