Re: [R] Can ROC be used as a metric for optimal model selection for randomForest?

From: Frank Harrell <f.harrell_at_vanderbilt.edu>
Date: Fri, 13 May 2011 14:29:52 -0700 (PDT)

Thanks for your note Max. Part of the picture is how predictions would be used. If they are used in a "forced choice" way (quite a shame because the best decision is often no decision - get more data) things are different. If there are gray zones or predicted probabilities are of interest then I'd avoid ROC area as a measure and use penalized likelihood (speaking in crude generality).

Frank

Max Kuhn wrote:
>
> Frank,
>
> It depends on how you define "optimal". While I'm not a big fan of
> using the area under the ROC to characterize performance, there are a
> lot of times when likelihood measures are clearly sub-optimal in
> performance. Using resampled accuracy (or Kappa) instead of deviance
> (out-of-bag or not) is likely to produce more inaccurate models (not
> shocking, right?).
>
> The best example is determining the number of boosting iterations.

>>From Friedman (2001): ``[...] degrading the likelihood by overfitting

> actually improves misclassification error rates. Although perhaps
> counterintuitive, this is not a contradiction; likelihood and error
> rate measure different aspects of fit quality.''
>
> My argument here assumes that you are fitting a model for the purposes
> of prediction rather than interpretation. This particular case
> involves random forests, so I'm hoping that statistical inference is
> not the goal.
>
>
> Ref: Friedman. Greedy function approximation: a gradient boosting
> machine. Annals of Statistics (2001) pp. 1189-1232
>
>
> Thanks,
>
> Max
>
> On Fri, May 13, 2011 at 8:11 AM, Frank Harrell
> &lt;f.harrell_at_vanderbilt.edu&gt; wrote:
>> Using anything other than deviance (or likelihood) as the objective
>> function
>> will result in a suboptimal model.
>> Frank
>>
>> -----
>> Frank Harrell
>> Department of Biostatistics, Vanderbilt University
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/Can-ROC-be-used-as-a-metric-for-optimal-model-selection-for-randomForest-tp3519003p3520043.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>

>
>
>
> --
>
> Max
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


Frank Harrell
Department of Biostatistics, Vanderbilt University
--
View this message in context: http://r.789695.n4.nabble.com/Can-ROC-be-used-as-a-metric-for-optimal-model-selection-for-randomForest-tp3519003p3521274.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Fri 13 May 2011 - 21:34:26 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 13 May 2011 - 21:40:07 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive