Re: [R] rpart

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Tue 26 Sep 2006 - 11:54:22 GMT

On Tue, 26 Sep 2006, henrigel@gmx.de wrote:

>
> -------- Original-Nachricht --------
> Datum: Tue, 26 Sep 2006 09:56:53 +0100 (BST)
> Von: Prof Brian Ripley <ripley@stats.ox.ac.uk>
> An: henrigel@gmx.de
> Betreff: Re: [R] rpart
>
>> On Mon, 25 Sep 2006, henrigel@gmx.de wrote:
>>
>>> Dear r-help-list:
>>>
>>> If I use the rpart method like
>>>
>>> cfit<-rpart(y~.,data=data,...),
>>>
>>> what kind of tree is stored in cfit?
>>> Is it right that this tree is not pruned at all, that it is the full
>> tree?
>>
>> It is an rpart object. This contains both the tree and the instructions
>> for pruning it at all values of cp: note that cp is also used in deciding
>> how large a tree to grow.
>>
>
> Ok, I have to explain my problem a little bit more in detail, I'm sorry for being so vague:
> I used the method in the following way:
> cfit<- rpart(y~., method="class", minsplit=1, cp=0)
> I got a tree with a lot of terminals nodes that contained more than 100 observations. This made me believe that the tree was already pruned.
> On the other hand, the printcp method showed subtrees that were "better".
> This made me believe that the tree hadn't been pruned before.
> So, are the trees "a little bit" pruned?

Yes, as you asked for cp=0. Look up what that does in ?rpart.control.

>>> If so, it's up to me to choose a subtree by using the printcp method.
>>
>> Or the plotcp method.
>>
>>> In the technical report from Atkinson and Therneau "An Introduction to
>>> recursive partitioning using the rpart routines" from 2000, one can see
>>> the following table on page 15:
>>>
>>> CP nsplit relerror xerror xstd
>>> 1 0.105 0 1.00000 1.0000 0.108
>>> 2 0.056 3 0.68519 1.1852 0.111
>>> 3 0.028 4 0.62963 1.0556 0.109
>>> 4 0.574 6 0.57407 1.0556 0.109
>>> 5 0.100 7 0.55556 1.0556 0.109
>>>
>>> Some lines below it says "We see that the best tree has 5 terminal nodes
>>> (4 splits). Why that if the xerror is the lowest for the tree only
>>> consisting of the root?
>>
>> There are *two* reports with that name: this seems to be from minitech.ps.
>> The choice is explained in the rest of that para (the 1-SE rule was used).
>> My guess is that the authors excluded the root as not being a tree, but
>> only they can answer that.
>>
>
> Are both reports from 2000? But you're right, I'm talking about the one from minitch.ps.
> The 1-SE-rule only explains why they didn't choose the tree with 6 or 7 splits, but not why they didn't choose the "tree" without a split.
> The exclusion of the root as not being a tree was my first explanation, too. But if the tree only consisting of the root is still better than any other tree, why would I choose a tree with 4 splits then?
>
> Henri
>
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Tue Sep 26 22:02:03 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 26 Sep 2006 - 16:31:00 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.