Re: [R] pseudo-R2 or GOF for regression trees?

From: Frank E Harrell Jr <f.harrell_at_vanderbilt.edu>
Date: Sat, 05 May 2007 15:52:25 -0500

Prof. Jeffrey Cardille wrote:
> Hello,
>
> Is there an accepted way to convey, for regression trees, something
> akin to R-squared?
>
> I'm developing regression trees for a continuous y variable and I'd
> like to say how well they are doing. In particular, I'm analyzing the
> results of a simulation model having highly non-linear behavior, and
> asking what characteristics of the inputs are related to a particular
> output measure. I've got a very large number of points: n=4000. I'm
> not able to do a model sensitivity analysis because of the large
> number of inputs and the model run time.
>
> I've been googling around both on the archives and on the rest of the
> web for several hours, but I'm still having trouble getting a firm
> sense of the state of the art. Could someone help me to quickly
> understand what strategy, if any, is acceptable to say something like
> "The regression tree in Figure 3 captures 42% of the variance"? The
> target audience is readers who will be interested in the subsequent
> verbal explanation of the relationship, but only once they are
> comfortable that the tree really does capture something. I've run
> across methods to say how well a tree does relative to a set of trees
> on the same data, but that doesn't help much unless I'm sure the
> trees in question are really capturing the essence of the system.
>
> I'm happy to be pointed to a web site or to a thread I may have
> missed that answers this exact question.
>
> Thanks very much,
>
> Jeff
>
> ------------------------------------------
> Prof. Jeffrey Cardille
> jeffrey.cardille_at_umontreal.ca
> R-help_at_stat.math.ethz.ch mailing list

Ye (below) has a method to get a nearly unbiased estimate of R^2 from recursive partitioning. In his examples the result was similar to using the formula for adjusted R^2 with regression degrees of freedom equal to about 3n/4. You can also use something like 10-fold cross-validation repeated 20 times to get a fairly precise and unbiased estimate of R^2.

Frank

>@ARTICLE{ye98mea,

   author = {Ye, Jianming},
   year = 1998,
   title = {On measuring and correcting the effects of data mining and model

           selection},
   journal = JASA,
   volume = 93,
   pages = {120-131},
   annote = {generalized degrees of freedom;GDF;effective degrees of

            freedom;data mining;model selection;model
            uncertainty;overfitting;nonparametric regression;CART;simulation
            setup}

}
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Sat 05 May 2007 - 20:56:12 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 05 May 2007 - 22:31:29 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.