From: <apjaworski_at_mmm.com>

Date: Thu, 06 Mar 2008 15:08:43 -0600

Andy Jaworski

518-1-01

Process Laboratory

3M Corporate Research Laboratory

E-mail: apjaworski_at_mmm.com

Tel: (651) 733-6092

Fax: (651) 736-3122

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 06 Mar 2008 - 21:16:42 GMT

Date: Thu, 06 Mar 2008 15:08:43 -0600

Let me first explain why I need this and then give some details of what I have found out so far.

The rpart function in R accepts weights. This seems to allow for a rather simple implementation of bagging. In fact Everitt and Hothorn in chapter 8 of "A Handbook of Statistical Analyses Using R" describe such a procedure. The procedure consists in generating several samples with replacement from the original data set. This data set has N rows. The implementation described in the book first fits a non-pruned tree to the original data set. Then it generates several (say, 25) multinomial samples of size N with probabilities 1/N. Then, each sample is used in turn as the weight vector to update the original tree fit. Finally, all the updated trees are combined to produce "consensus" class predictions.

Now, a typical realization of a multinomial sample consists of small integers and several 0's. I thought that the way that weighting worked was this: the observations with weights equal to 0 are omitted and the observations with weights > 1 are essentially replicated according to the weight. So I thought that instead of running the rpart procedure with weights, say, starting with (1, 0, 2, 0, 1, ... etc.) I could simply generate a sample data set by retaining row 1, omitting row 2, replicating row 3 twice, omitting row 4, retaining row 5, etc. However, this does not seem to work as I expected. Instead of getting identical trees (from running weighted rpart on the original data set and running rpart on the sample data set described above with no weighting) I get trees that are completely different (different threshold values and different order of variables entering the splits). Moreover, the predictions from these trees can be different so the misclassification rates usually differ.

This finally brings me to my question - is there a way to mimic the workings of the weighting in rpart by, for example, modification of the data set or, perhaps, some other means.

Thanks in advance for your time,

Andy

Andy Jaworski

518-1-01

Process Laboratory

3M Corporate Research Laboratory

E-mail: apjaworski_at_mmm.com

Tel: (651) 733-6092

Fax: (651) 736-3122

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 06 Mar 2008 - 21:16:42 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Fri 07 Mar 2008 - 09:30:19 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*