From: Paolo Radaelli <paolo.radaelli_at_unimib.it>

Date: Wed 03 Jan 2007 - 16:56:18 GMT

19 5 9.381

20 8 9.986

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu Jan 04 04:02:46 2007

Date: Wed 03 Jan 2007 - 16:56:18 GMT

Dear all,

I'm trying to manage with user defined split function in rpart
(file rpart\tests\usersplits.R in

http://cran.r-project.org/src/contrib/rpart_3.1-34.tar.gz - see bottom of
the email).

Suppose to have the following data.frame (note that x's values are already
sorted)

> D

y x

1 7 0.428 2 3 0.876 3 1 1.467 4 6 1.492 5 3 1.703 6 4 2.406 7 8 2.628 8 6 2.879 9 5 3.025 10 3 3.494 11 2 3.496 12 6 4.623 13 4 4.824 14 6 4.847 15 2 6.234 16 7 7.041 17 2 8.600 18 4 9.225

19 5 9.381

20 8 9.986

Running rpart and setting minbucket=1 and maxdepth=1 we get the following tree (which uses, by default, deviance): > rpart(D$y~D$x,control=rpart.control(minbucket=1,maxdepth=1))

n= 20

node), split, n, deviance, yval * denotes terminal node

1) root 20 84.80000 4.600000 2) D$x< 9.6835 19 72.63158 4.421053 * 3) D$x>=9.6835 1 0.00000 8.000000 *

This means that the first 19 observation has been sent to the left side of the tree and one observation to the right. This is correct when we observe goodness (the maximum is the last element of the vector).

The thing i really don't understand is the direction vector.

# direction= -1 = send "y< cutpoint" to the left side of the tree

# 1 = send "y< cutpoint" to the right

What does it mean ?

In the example here considered we have

> sign(lmean)

[1] 1 1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Which is the criterion used ?

In my opinion we should have all the values equal to -1 given that they have
to be sent to left side of the tree.

Does someone can help me ?

Thank you

*#######################################################
**# The split function, where most of the work occurs.
**# Called once per split variable per node.
**# If continuous=T (the case here considered)
**# The actual x variable is ordered
**# y is supplied in the sort order of x, with no missings,
**# return two vectors of length (n-1):
**# goodness = goodness of the split, larger numbers are better.
**# 0 = couldn't find any worthwhile split
**# the ith value of goodness evaluates splitting obs 1:i vs (i+1):n
**# direction= -1 = send "y< cutpoint" to the left side of the tree
**# 1 = send "y< cutpoint" to the right
**# this is not a big deal, but making larger "mean y's" move towards
**# the right of the tree, as we do here, seems to make it easier to
**# read
**# If continuos=F, x is a set of integers defining the groups for an
**# unordered predictor. In this case:
**# direction = a vector of length m= "# groups". It asserts that the
**# best split can be found by lining the groups up in this order
**# and going from left to right, so that only m-1 splits need to
**# be evaluated rather than 2^(m-1)
**# goodness = m-1 values, as before.
**#
**# The reason for returning a vector of goodness is that the C routine
*

# enforces the "minbucket" constraint. It selects the best return value

# that is not too close to an edge.

The vector wt of weights in our case is:
> wt

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

temp2 <- function(y, wt, x, parms, continuous) {

# Center y

n <- length(y)

y <- y- sum(y*wt)/sum(wt)

if (continuous) {

# continuous x variable

temp <- cumsum(y*wt)[-n]

left.wt <- cumsum(wt)[-n]

right.wt <- sum(wt) - left.wt

lmean <- temp/left.wt

rmean <- -temp/right.wt

goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2)
list(goodness= goodness, direction=sign(lmean))
}

}

Paolo Radaelli

Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali
Facoltà di Economia

Università degli Studi di Milano-Bicocca
P.zza dell'Ateneo Nuovo, 1

20126 Milano

Italy

e-mail paolo.radaelli@unimib.it

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu Jan 04 04:02:46 2007

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Wed 03 Jan 2007 - 17:30:32 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*