Re: [R] regression tree xerror

From: Luis Torgo <ltorgo_at_liacc.up.pt>
Date: Wed 30 Mar 2005 - 04:13:23 EST

Sherri Miller wrote:

>I am running some models (for the first time) using rpart and am getting
>results I don't know how to interpret. I'm using cross-validation to prune
>the tree and the results look like:
>Root node error: 172.71/292 = 0.59148
>
>n= 292
>
> CP nsplit rel error xerror xstd
>1 0.124662 0 1.00000 1.00731 0.093701
>2 0.064634 1 0.87534 1.08076 0.092337
>3 0.057300 2 0.81070 1.07684 0.095582
>4 0.038462 4 0.69610 0.99104 0.091659
>5 0.036200 5 0.65764 1.01596 0.094635
>6 0.029228 7 0.58524 1.00058 0.095440
>7 0.028779 8 0.55601 1.00704 0.093242
>8 0.024192 9 0.52724 0.97844 0.088936
>9 0.018038 11 0.47885 1.02749 0.092263
>10 0.016867 13 0.44278 1.08704 0.092112
>11 0.015465 14 0.42591 1.10805 0.097813
>12 0.015000 15 0.41044 1.11130 0.097881
>
>I do not understand why the rel error rate is going down, but the xerror
>generally goes up. For some of the runs, the xerror never goes down. Is
>result caused by something in my data structure? I have run some example
>datasets from the various help manuals and the xerror goes down, as one
>would expect. Any suggestions?
>
>Sherri
>
>Sherri L. Miller
>Wildlife Biologist
>Redwood Sciences Laboratory
>707.825.2949
>707.825.2901 (FAX)
>
>______________________________________________
>R-help@stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
>
rel error is estimated with the training data (the sample used for obtaining the tree) and thus it decreases as the tree increases, because the tree becomes more and more adjusted to the data. This apparently better performance should not be taken for "real" when predicting for a new sample of data because larger trees do tend to overfit the traning sample and will hardly generalise well on new fresh data samples.

That's the motivation for the xerror (and xstd) estimates. These are more realistic estimates of the performance of the tree on new samples of data. They are obtained by the rpart function by an internal cross validation process. The function prune() can be used to select a subtree of the tree obtained with rpart() if you think (by looking at the xerror estimates) you would be better off with this subtree.

Hope this helps.

Luis Torgo

-- 
Luis Torgo
    FEP/LIACC, University of Porto   Phone : (+351) 22 339 20 93
    Machine Learning Group           Fax   : (+351) 22 339 20 99
    R. de Ceuta, 118, 6o             email : ltorgo@liacc.up.pt
    4050-190 PORTO - PORTUGAL        WWW   : http://www.liacc.up.pt/~ltorgo

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Wed Mar 30 04:40:52 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:30:57 EST