Re: [R] User defined split function in Rpart

From: R Help <rhelp.stats_at_gmail.com>
Date: Wed, 13 Feb 2008 12:15:40 -0400

Direction corresponds to goodness: for the split represented by goodness[i], direction[i]=-1 means that values less than the split at goodness[i] will go left, greater than will go right. If direction[i] = 1 then they will be sent to opposite sides.

The long-and-short of it is that, for most trees, we want to send splits smaller than the split value left, and greater than right, so direction should be -1 for all values, ie, direction = rep(-1,length(goodness). The vector is only added if you want to customize the structure of your tree.

Hope that helps,
Sam

On Jan 3, 2007 12:56 PM, Paolo Radaelli <paolo.radaelli_at_unimib.it> wrote:
> Dear all,
> I'm trying to manage with user defined split function in rpart
> (file rpart\tests\usersplits.R in
> http://cran.r-project.org/src/contrib/rpart_3.1-34.tar.gz - see bottom of
> the email).
> Suppose to have the following data.frame (note that x's values are already
> sorted)
> > D
> y x
> 1 7 0.428
> 2 3 0.876
> 3 1 1.467
> 4 6 1.492
> 5 3 1.703
> 6 4 2.406
> 7 8 2.628
> 8 6 2.879
> 9 5 3.025
> 10 3 3.494
> 11 2 3.496
> 12 6 4.623
> 13 4 4.824
> 14 6 4.847
> 15 2 6.234
> 16 7 7.041
> 17 2 8.600
> 18 4 9.225
> 19 5 9.381
> 20 8 9.986
>
> Running rpart and setting minbucket=1 and maxdepth=1 we get the following
> tree (which uses, by default, deviance):
> > rpart(D$y~D$x,control=rpart.control(minbucket=1,maxdepth=1))
> n= 20
> node), split, n, deviance, yval * denotes terminal node
> 1) root 20 84.80000 4.600000
> 2) D$x< 9.6835 19 72.63158 4.421053 *
> 3) D$x>=9.6835 1 0.00000 8.000000 *
>
> This means that the first 19 observation has been sent to the left side of
> the tree and one observation to the right.
> This is correct when we observe goodness (the maximum is the last element of
> the vector).
>
> The thing i really don't understand is the direction vector.
> # direction= -1 = send "y< cutpoint" to the left side of the tree
> # 1 = send "y< cutpoint" to the right
>
> What does it mean ?
> In the example here considered we have
> > sign(lmean)
> [1] 1 1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1
>
> Which is the criterion used ?
> In my opinion we should have all the values equal to -1 given that they have
> to be sent to left side of the tree.
> Does someone can help me ?
> Thank you
>
> #######################################################
> # The split function, where most of the work occurs.
> # Called once per split variable per node.
> # If continuous=T (the case here considered)
> # The actual x variable is ordered
> # y is supplied in the sort order of x, with no missings,
> # return two vectors of length (n-1):
> # goodness = goodness of the split, larger numbers are better.
> # 0 = couldn't find any worthwhile split
> # the ith value of goodness evaluates splitting obs 1:i vs (i+1):n
> # direction= -1 = send "y< cutpoint" to the left side of the tree
> # 1 = send "y< cutpoint" to the right
> # this is not a big deal, but making larger "mean y's" move towards
> # the right of the tree, as we do here, seems to make it easier to
> # read
> # If continuos=F, x is a set of integers defining the groups for an
> # unordered predictor. In this case:
> # direction = a vector of length m= "# groups". It asserts that the
> # best split can be found by lining the groups up in this order
> # and going from left to right, so that only m-1 splits need to
> # be evaluated rather than 2^(m-1)
> # goodness = m-1 values, as before.
> #
> # The reason for returning a vector of goodness is that the C routine
> # enforces the "minbucket" constraint. It selects the best return value
> # that is not too close to an edge.
> The vector wt of weights in our case is:
> > wt
> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
> temp2 <- function(y, wt, x, parms, continuous) {
> # Center y
> n <- length(y)
> y <- y- sum(y*wt)/sum(wt)
> if (continuous) {
> # continuous x variable
> temp <- cumsum(y*wt)[-n]
> left.wt <- cumsum(wt)[-n]
> right.wt <- sum(wt) - left.wt
> lmean <- temp/left.wt
> rmean <- -temp/right.wt
> goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2)
> list(goodness= goodness, direction=sign(lmean))
> }
> }
>
> Paolo Radaelli
> Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali
> FacoltÓ di Economia
> UniversitÓ degli Studi di Milano-Bicocca
> P.zza dell'Ateneo Nuovo, 1
> 20126 Milano
> Italy
> e-mail paolo.radaelli_at_unimib.it
>
> ______________________________________________
> R-help_at_stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 13 Feb 2008 - 16:20:59 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 13 Feb 2008 - 16:30:13 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive