Date: Wed, 13 Feb 2008 12:15:40 -0400

Direction corresponds to goodness: for the split represented by goodness[i], direction[i]=-1 means that values less than the split at goodness[i] will go left, greater than will go right. If direction[i] = 1 then they will be sent to opposite sides.

The long-and-short of it is that, for most trees, we want to send splits smaller than the split value left, and greater than right, so direction should be -1 for all values, ie, direction = rep(-1,length(goodness). The vector is only added if you want to customize the structure of your tree.

Hope that helps,

Sam

On Jan 3, 2007 12:56 PM, Paolo Radaelli <paolo.radaelli_at_unimib.it> wrote:

> Dear all,

*> I'm trying to manage with user defined split function in rpart
**> (file rpart\tests\usersplits.R in
**> http://cran.r-project.org/src/contrib/rpart_3.1-34.tar.gz - see bottom of
**> the email).
**> Suppose to have the following data.frame (note that x's values are already
**> sorted)
**> > D
**> y x
**> 1 7 0.428
**> 2 3 0.876
**> 3 1 1.467
**> 4 6 1.492
**> 5 3 1.703
**> 6 4 2.406
**> 7 8 2.628
**> 8 6 2.879
**> 9 5 3.025
**> 10 3 3.494
**> 11 2 3.496
**> 12 6 4.623
**> 13 4 4.824
**> 14 6 4.847
**> 15 2 6.234
**> 16 7 7.041
**> 17 2 8.600
**> 18 4 9.225
**> 19 5 9.381
**> 20 8 9.986
**>
**> Running rpart and setting minbucket=1 and maxdepth=1 we get the following
**> tree (which uses, by default, deviance):
**> > rpart(D$y~D$x,control=rpart.control(minbucket=1,maxdepth=1))
**> n= 20
**> node), split, n, deviance, yval * denotes terminal node
**> 1) root 20 84.80000 4.600000
**> 2) D$x< 9.6835 19 72.63158 4.421053 *
**> 3) D$x>=9.6835 1 0.00000 8.000000 *
**>
**> This means that the first 19 observation has been sent to the left side of
**> the tree and one observation to the right.
**> This is correct when we observe goodness (the maximum is the last element of
**> the vector).
**>
**> The thing i really don't understand is the direction vector.
**> # direction= -1 = send "y< cutpoint" to the left side of the tree
**> # 1 = send "y< cutpoint" to the right
**>
**> What does it mean ?
**> In the example here considered we have
**> > sign(lmean)
**> [1] 1 1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1
**>
**> Which is the criterion used ?
**> In my opinion we should have all the values equal to -1 given that they have
**> to be sent to left side of the tree.
**> Does someone can help me ?
**> Thank you
**>
**> #######################################################
**> # The split function, where most of the work occurs.
**> # Called once per split variable per node.
**> # If continuous=T (the case here considered)
**> # The actual x variable is ordered
**> # y is supplied in the sort order of x, with no missings,
**> # return two vectors of length (n-1):
**> # goodness = goodness of the split, larger numbers are better.
**> # 0 = couldn't find any worthwhile split
**> # the ith value of goodness evaluates splitting obs 1:i vs (i+1):n
**> # direction= -1 = send "y< cutpoint" to the left side of the tree
**> # 1 = send "y< cutpoint" to the right
**> # this is not a big deal, but making larger "mean y's" move towards
**> # the right of the tree, as we do here, seems to make it easier to
**> # read
**> # If continuos=F, x is a set of integers defining the groups for an
**> # unordered predictor. In this case:
**> # direction = a vector of length m= "# groups". It asserts that the
**> # best split can be found by lining the groups up in this order
**> # and going from left to right, so that only m-1 splits need to
**> # be evaluated rather than 2^(m-1)
**> # goodness = m-1 values, as before.
**> #
**> # The reason for returning a vector of goodness is that the C routine
**> # enforces the "minbucket" constraint. It selects the best return value
**> # that is not too close to an edge.
**> The vector wt of weights in our case is:
**> > wt
**> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
**>
**> temp2 <- function(y, wt, x, parms, continuous) {
**> # Center y
**> n <- length(y)
**> y <- y- sum(y*wt)/sum(wt)
**> if (continuous) {
**> # continuous x variable
**> temp <- cumsum(y*wt)[-n]
**> left.wt <- cumsum(wt)[-n]
**> right.wt <- sum(wt) - left.wt
**> lmean <- temp/left.wt
**> rmean <- -temp/right.wt
**> goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2)
**> list(goodness= goodness, direction=sign(lmean))
**> }
**> }
**>
**> Paolo Radaelli
**> Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali
**> Facoltà di Economia
**> Università degli Studi di Milano-Bicocca
**> P.zza dell'Ateneo Nuovo, 1
**> 20126 Milano
**> Italy
**> e-mail paolo.radaelli_at_unimib.it
**>
**>
*

*
*