Re: [R] randomForest

From: Duncan Murdoch <murdoch_at_stats.uwo.ca>
Date: Fri 08 Jul 2005 - 05:44:38 EST

On 7/7/2005 3:38 PM, Weiwei Shi wrote:
> Hi there:
> I have a question on random foresst:
>
> recently i helped a friend with her random forest and i came with this problem:
> her dataset has 6 classes and since the sample size is pretty small:
> 264 and the class distr is like this (Diag is the response variable)
> sample.size <- lapply(1:6, function(i) sum(Diag==i))

>> sample.size

> [[1]]
> [1] 36
>
> [[2]]
> [1] 12
>
> [[3]]
> [1] 120
>
> [[4]]
> [1] 36
>
> [[5]]
> [1] 30
>
> [[6]]
> [1] 30
>
> I assigned this sample.size to sampsz for a stratiefied sampling
> purpose and i got the following error:
> Error in sum(..., na.rm = na.rm) : invalid 'mode' of argument
>
> if I use sampsz=c(36, 12, 120, 36, 30, 30), then it is fine. Could you
> tell me why?

The sum() function knows what to do on a vector, but not on a list. You can turn your sample.size variable into a vector using

unlist(sample.size)

Duncan Murdoch

> btw, as to classification problem for this with uneven class number
> situation, do u have some suggestions to improve its accuracy? I
> tried to use c() way to make the sampsz works but the result is
> similar.
>
> Thanks,
>
> weiwei
>
> On 6/30/05, Liaw, Andy <andy_liaw@merck.com> wrote:

>> The limitation comes from the way categorical splits are represented in the
>> code:  For a categorical variable with k categories, the split is
>> represented by k binary digits: 0=right, 1=left.  So it takes k bits to
>> store each split on k categories.  To save storage, this is `packed' into a
>> 4-byte integer (32-bit), thus the limit of 32 categories.
>> 
>> The current Fortran code (version 5.x) by Breiman and Cutler gets around
>> this limitation by storing the split in an integer array.  While this lifts
>> the 32-category limit, it takes much more memory to store the splits.  I'm
>> still trying to figure out a more memory efficient way of storing the splits
>> without imposing the 32-category limit.  If anyone has suggestions, I'm all
>> ears.
>> 
>> Best,
>> Andy
>> 
>> > From: Arne.Muller@sanofi-aventis.com
>> >
>> > Hello,
>> >
>> > I'm using the random forest package. One of my factors in the
>> > data set contains 41 levels (I can't code this as a numeric
>> > value - in terms of linear models this would be a random
>> > factor). The randomForest call comes back with an error
>> > telling me that the limit is 32 categories.
>> >
>> > Is there any reason for this particular limit? Maybe it's
>> > possible to recompile the module with a different cutoff?
>> >
>> >       thanks a  lot for your help,
>> >       kind regards,
>> >
>> >
>> >       Arne
>> >
>> > ______________________________________________
>> > R-help@stat.math.ethz.ch mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide!
>> > http://www.R-project.org/posting-guide.html
>> >
>> >
>> >
>> 
>> ______________________________________________
>> R-help@stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>> 

>

>

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Jul 08 06:04:30 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:21 EST