Re: [R] randomForest

From: Liaw, Andy <andy_liaw_at_merck.com>
Date: Fri 08 Jul 2005 - 07:07:13 EST


With small sample sizes the variability for estimate of test set error will be large. Instead of splitting the data once, you should consider cross-validation or bootstrap for estimating performance.

AFAIK gbm as is won't handle more than two classes. You will need to do quite a bit of work to get it to do what MART does.

Andy

> From: Weiwei Shi
>
> thanks. but can you suggest some ways for the classification problems
> since for some specific class, there are too few observations.
>
> the following is from adding sample.size :
> > najie.rf.2 <- randomForest(Diag~.,
> data=one.df[ind==1,4:ncol(one.df)], importance=T,
> sampsize=unlist(sample.size))
> > najie.pred.2<- predict(najie.rf.2, one.df[ind==2,])
>
> > table(observed=one.df[ind==2,"Diag"], predicted=najie.pred.2)
> predicted
> observed 1 2 3 4 5 6
> 1 6 0 1 0 0 1
> 2 0 4 0 0 0 0
> 3 1 0 37 0 0 0
> 4 0 0 3 5 0 0
> 5 1 0 3 0 8 0
> 6 0 0 0 3 0 5
>
> and class number returned from sample.size is like:
> 28, 8, 82, 28, 18, 22
>
> Should I use gbm to try it since it might "focus" more on
> misplaced cases?
>
> thanks,
>
> weiwei
>
>
> On 7/7/05, Liaw, Andy <andy_liaw@merck.com> wrote:
> > > From: Weiwei Shi
> > >
> > > it works.
> > > thanks,
> > >
> > > but: (just curious)
> > > why i tried previously and i got
> > >
> > > > is.vector(sample.size)
> > > [1] TRUE
> >
> > Because a list is also a vector:
> >
> > > a <- c(list(1), list(2))
> > > a
> > [[1]]
> > [1] 1
> >
> > [[2]]
> > [1] 2
> >
> > > is.vector(a)
> > [1] TRUE
> > > is.numeric(a)
> > [1] FALSE
> >
> > Actually, the way I initialize a list of known length is by
> something like:
> >
> > myList <- vector(mode="list", length=veryLong)
> >
> > Andy
> >
> >
> > > i also tried as.vector(sample.size) and assigned it to
> sampsz,it still
> > > does not work.
> > >
> > > On 7/7/05, Duncan Murdoch <murdoch@stats.uwo.ca> wrote:
> > > > On 7/7/2005 3:38 PM, Weiwei Shi wrote:
> > > > > Hi there:
> > > > > I have a question on random foresst:
> > > > >
> > > > > recently i helped a friend with her random forest and i
> > > came with this problem:
> > > > > her dataset has 6 classes and since the sample size is

> > > pretty small:
> > > > > 264 and the class distr is like this (Diag is the
> > > response variable)
> > > > > sample.size <- lapply(1:6, function(i) sum(Diag==i))
> > > > >> sample.size
> > > > > [[1]]
> > > > > [1] 36
> > > > >
> > > > > [[2]]
> > > > > [1] 12
> > > > >
> > > > > [[3]]
> > > > > [1] 120
> > > > >
> > > > > [[4]]
> > > > > [1] 36
> > > > >
> > > > > [[5]]
> > > > > [1] 30
> > > > >
> > > > > [[6]]
> > > > > [1] 30
> > > > >
> > > > > I assigned this sample.size to sampsz for a
> stratiefied sampling
> > > > > purpose and i got the following error:
> > > > > Error in sum(..., na.rm = na.rm) : invalid 'mode' of argument
> > > > >
> > > > > if I use sampsz=c(36, 12, 120, 36, 30, 30), then it is
> > > fine. Could you
> > > > > tell me why?
> > > >
> > > > The sum() function knows what to do on a vector, but not on
> > > a list. You
> > > > can turn your sample.size variable into a vector using
> > > >
> > > > unlist(sample.size)
> > > >
> > > > Duncan Murdoch
> > > >
> > > > > btw, as to classification problem for this with uneven
> > > class number
> > > > > situation, do u have some suggestions to improve its
> accuracy? I
> > > > > tried to use c() way to make the sampsz works but the
> result is
> > > > > similar.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > weiwei
> > > > >
> > > > > On 6/30/05, Liaw, Andy <andy_liaw@merck.com> wrote:
> > > > >> The limitation comes from the way categorical splits are
> > > represented in the
> > > > >> code: For a categorical variable with k categories,
> the split is
> > > > >> represented by k binary digits: 0=right, 1=left. So it
> > > takes k bits to
> > > > >> store each split on k categories. To save storage, this
> > > is `packed' into a
> > > > >> 4-byte integer (32-bit), thus the limit of 32 categories.
> > > > >>
> > > > >> The current Fortran code (version 5.x) by Breiman and
> > > Cutler gets around
> > > > >> this limitation by storing the split in an integer
> > > array. While this lifts
> > > > >> the 32-category limit, it takes much more memory to
> > > store the splits. I'm
> > > > >> still trying to figure out a more memory efficient way
> > > of storing the splits
> > > > >> without imposing the 32-category limit. If anyone has
> > > suggestions, I'm all
> > > > >> ears.
> > > > >>
> > > > >> Best,
> > > > >> Andy
> > > > >>
> > > > >> > From: Arne.Muller@sanofi-aventis.com
> > > > >> >
> > > > >> > Hello,
> > > > >> >
> > > > >> > I'm using the random forest package. One of my
> factors in the
> > > > >> > data set contains 41 levels (I can't code this as a numeric
> > > > >> > value - in terms of linear models this would be a random
> > > > >> > factor). The randomForest call comes back with an error
> > > > >> > telling me that the limit is 32 categories.
> > > > >> >
> > > > >> > Is there any reason for this particular limit? Maybe it's
> > > > >> > possible to recompile the module with a different cutoff?
> > > > >> >
> > > > >> > thanks a lot for your help,
> > > > >> > kind regards,
> > > > >> >
> > > > >> >
> > > > >> > Arne
> > > > >> >
> > > > >> > ______________________________________________
> > > > >> > R-help@stat.math.ethz.ch mailing list
> > > > >> > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > >> > PLEASE do read the posting guide!
> > > > >> > http://www.R-project.org/posting-guide.html
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >> ______________________________________________
> > > > >> R-help@stat.math.ethz.ch mailing list
> > > > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > > > >> PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Weiwei Shi, Ph.D
> > >
> > > "Did you always know?"
> > > "No, I did not. But I believed..."
> > > ---Matrix III
> > >
> > > ______________________________________________
> > > R-help@stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> > >
> > >
> >
> >
> >
> >
> --------------------------------------------------------------
> ----------------
> > Notice: This e-mail message, together with any
> attachments, contains information of Merck & Co., Inc. (One
> Merck Drive, Whitehouse Station, New Jersey, USA 08889),
> and/or its affiliates (which may be known outside the United
> States as Merck Frosst, Merck Sharp & Dohme or MSD and in
> Japan, as Banyu) that may be confidential, proprietary
> copyrighted and/or legally privileged. It is intended solely
> for the use of the individual or entity named on this
> message. If you are not the intended recipient, and have
> received this message in error, please notify us immediately
> by reply e-mail and then delete it from your system.
> >
> --------------------------------------------------------------
> ----------------
> >
>
>
> --
> Weiwei Shi, Ph.D
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>
>
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Jul 08 07:16:56 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:22 EST