Re: [R] imbalanced classes

From: Liaw, Andy <>
Date: Thu 26 Jan 2006 - 13:00:51 EST


I guess the message is meant for me (yet you sent it to R-help).

If you have 10 class A and 100 class B, not setting sampsize would cause a random sample (with replacement) of 110 from the whole lot, which, of course, would give you on the average 10 times more Bs than As in the sample. If you grow a tree on such a sample, it's not going to do so well in predicting the As. However, if you set sampsize=c(10, 10), then each tree is grown on 10 randomly sampled As and 10 randomly sampled Bs, giving the tree a much better chance of giving roughly similar error rates for predicting As and Bs. If setting the sampsize to be equal doesn't quite do it, you can try setting it to the more extreme direction.

As to cutoff, in a two-class problem, it's the same as setting the classification threshold to something other than 0.5. E.g., if cutoff=c(0.9, 0.1), then a case with 80% of the votes for class A would still be classified as B, because .8/.9 < .2/.1. Hope that's clear.

I do have to wonder, though, if you only have a total of 37 cases in the data, how can you be sure the estimates of class error rates you get will pan out on a larger test set? I would think the variability on the estimate of the class error rates is so high that it doesn't make too much sense to try to balance them too much... Just my $0.02.

I do plan on implementing the weighted RF (see the To Do part of rfNews()), but don't hold your breath...


From: Mark D'Ascenzo
> Hi Andy,
> I know this topic has been discussed before on the R-help, but I was
> wondering if you could offer some advice specific to my application.
> I'm using the R random forest package to compare two classes of data,
> the number of cases in each class relatively low, 28 in class 1 and 9
> in class 2. I'd really like to use R environment to analyze this data,
> however I'm finding it difficult to put much trust in the results of
> my analysis. As you've stated, the classwt variables do not do much,
> and I've tried working with the cuttoff and sampsize variables as
> well, with limited success in balancing error rates between the two
> classes.
> It was unclear to me how to use the cuttoff parameter correctly. If
> you have any recommendations here, it would be appreciated.
> Additionally with the sampsize variable, I have tried a few values,
> for example setting sampsize = c(2, 6) and c(9, 3), etc. It wasn't
> clear to me if I should be sampling more from the larger class or the
> other way around.
> Lastly, I'm wondering if you are currently working or have plans to
> release in the near future an R version of randomForest that is
> equivalent to the FORTRAN rf5 package. It works wonderfully for my
> application, but getting data in and out of it, changing parameters,
> compiling is just a pain, as I'm sure you agree.
> Your thoughts would be greatly appreciated.
> Kind regards,
> Mark D'Ascenzo
> Biomedical Engineering
> Cornell University
> Ithaca, NY 14853
> ______________________________________________
> mailing list
> PLEASE do read the posting guide!
> mailing list PLEASE do read the posting guide! Received on Thu Jan 26 13:11:14 2006

This archive was generated by hypermail 2.1.8 : Thu 26 Jan 2006 - 20:06:08 EST