Date: Wed 05 May 2004 - 08:59:44 EST
Wondered about the best way to control for input variables that have a
large number of levels in 'rpart' models. I understand the algorithm
searches through all possible splits (2^(k-1) for k levels) and so
variables with more levels are more prone to be good spliters... so I'm
looking for ways to compensate and adjust for this complexity.
For example, if two variables produce comparable splits in the data but
one contains 2 levels and the other 13 levels then I would like to have
to have the algorithm choose the 'simpler' split.
Is this best done with the 'cost' argument in the rpart options? This
defaults to one for all variables... so would it make sense to scale
this by nlevels in each variable or sqrt(nlevels) or something similar?
[[alternative HTML version deleted]]
Rfirstname.lastname@example.org mailing list
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
This archive was generated by hypermail 2.1.3 : Mon 31 May 2004 - 23:05:07 EST