Re: [R] Random forests

From: Gavin Simpson <gavin.simpson_at_ucl.ac.uk>
Date: Wed, 19 Dec 2007 09:39:17 +0000

On Tue, 2007-12-18 at 16:27 -0600, Naiara Pinto wrote:
> Dear all,
>
> I would like to use a tree regression method to analyze my dataset. I
> am interested in the fact that random forests creates in-bag and
> out-of-bag datasets, but I also need an estimate of support for each
> split. That seems hard to do in random forests since each tree is
> grown using a subset of the predictor variables.
>
> I was thinking of setting mtry = number of predictor variables,
> growing several trees, and computing the support for each node as the
> number of times that a certain predictor variable was chosen for that
> node. Can this be implemented using random forests?

Hi Naiara,

I'm so not an expert here, but what you propose with mty = number of predictors will give you a procedure known as bagging.

You talk about support for the split and then for the node. Is this just a typo or are you interested in the two different things?

I'm not aware of how you do the latter in bagging or random forests as the whole point is to grow large trees not pruned ones. As to the former, trees are unstable, change the data used to train them just a little and you can get a very different fitted tree.

Bagging and random forests exploit this to produce a better prediction machine / classifier by using n poor trees rather than one best tree. They do this by adding randomness to the procedure by bootstrap sampling the training data, and in the case of random forest, randomly sampling a small number, mtry, of available predictors to grow each tree. As such there is no correspondence between the splits of one tree and the splits of another, so trying to compare how many times a certain split in one or more trees is formed by the same predictor. So it doesn't make sense (to me it may to others) to focus on individual splits in the n trees.

I don't know what you mean exactly by "support", but if you are trying to get a measure of how important each of your predictors is in explaining variance in your response, then take a look at the importance() function in the randomForest package. This produces a couple of measures that allow you to determine which predictors contribute most to reducing node impurity or MSE.

HTH G

>
> Thanks!
>
> Naiara.
>

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Wed 19 Dec 2007 - 09:45:29 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 19 Dec 2007 - 10:30:20 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.