[R] randomForest

From: Weiwei Shi <helprhelp_at_gmail.com>
Date: Fri 08 Jul 2005 - 05:38:14 EST

Hi there:
I have a question on random foresst:

recently i helped a friend with her random forest and i came with this problem: her dataset has 6 classes and since the sample size is pretty small: 264 and the class distr is like this (Diag is the response variable) sample.size <- lapply(1:6, function(i) sum(Diag==i))
> sample.size

[[1]]
[1] 36

[[2]]
[1] 12

[[3]]
[1] 120

[[4]]
[1] 36

[[5]]
[1] 30

[[6]]
[1] 30

I assigned this sample.size to sampsz for a stratiefied sampling purpose and i got the following error:
Error in sum(..., na.rm = na.rm) : invalid 'mode' of argument

if I use sampsz=c(36, 12, 120, 36, 30, 30), then it is fine. Could you tell me why?
btw, as to classification problem for this with uneven class number situation, do u have some suggestions to improve its accuracy? I tried to use c() way to make the sampsz works but the result is similar.

Thanks,

weiwei

On 6/30/05, Liaw, Andy <andy_liaw@merck.com> wrote:
> The limitation comes from the way categorical splits are represented in the
> code: For a categorical variable with k categories, the split is
> represented by k binary digits: 0=right, 1=left. So it takes k bits to
> store each split on k categories. To save storage, this is `packed' into a
> 4-byte integer (32-bit), thus the limit of 32 categories.
>
> The current Fortran code (version 5.x) by Breiman and Cutler gets around
> this limitation by storing the split in an integer array. While this lifts
> the 32-category limit, it takes much more memory to store the splits. I'm
> still trying to figure out a more memory efficient way of storing the splits
> without imposing the 32-category limit. If anyone has suggestions, I'm all
> ears.
>
> Best,
> Andy
>
> > From: Arne.Muller@sanofi-aventis.com
> >
> > Hello,
> >
> > I'm using the random forest package. One of my factors in the
> > data set contains 41 levels (I can't code this as a numeric
> > value - in terms of linear models this would be a random
> > factor). The randomForest call comes back with an error
> > telling me that the limit is 32 categories.
> >
> > Is there any reason for this particular limit? Maybe it's
> > possible to recompile the module with a different cutoff?
> >
> > thanks a lot for your help,
> > kind regards,
> >
> >
> > Arne
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> >
> >
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

-- 
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri Jul 08 05:45:14 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:21 EST