Re: [R] Running randomForests on large datasets

From: Nagu <thogiti_at_gmail.com>
Date: Wed, 27 Feb 2008 09:31:34 -0800

Thank you Andy.

It is throwing memory allocation error for me for numerous combinations of ntree and nodesize values. I tried with memory.limit() and memory.size to use the maximum memory but the error was consistent. But one thing I noticed was that I had tough time even just loading the dataset previously. I, then, used Rcmdr library to load the same data, and it was faster than just loading with the R console and it didn't throw any memory errors like it used to throw previously, now and then. I thought that may be this was a fluke with Rcmdr, I, then, opened it a few more times and every time Rcmdr was consistent in loading the large dataset without any allocation errors. I also tried with opening a few other programs on the desktop, repeated the process, it loaded just fine.

Any ideas on how Rcmdr is loading the file as opposed to R console (I am using read.table())?

Anyway, I thought I'd share this observation with the others. Thank you Andy for your ideas. I'll keep tinkering with the parameters.

Thank you,
Nagu

On Wed, Feb 27, 2008 at 5:24 AM, Liaw, Andy <andy_liaw_at_merck.com> wrote:
> There are a couple of things you may want to try, if you can load the
> data into R and still have enough to spare:
>
> - Run randomForest() with fewer trees, say 10 to start with.
>
> - Run randomForest() with nodesize set to something larger than the
> default (5 for classification). This puts a limit on the size of the
> trees being grown. Try something like 21 and see if that runs, and
> adjust accordingly.
>
> HTH,
> Andy
>
>
> From: Nagu
>
>
>
> > Hi,
> >
> > I am trying to run randomForests on a datasets of size 500000X650 and
> > R pops up memory allocation error. Are there any better ways to deal
> > with large datasets in R, for example, Splus had something like
> > bigData library.
> >
> > Thank you,
> > Nagu
> >
>
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
> >
>
>
> ------------------------------------------------------------------------------
> Notice: This e-mail message, together with any attachments, contains
> information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,
> New Jersey, USA 08889), and/or its affiliates (which may be known
> outside the United States as Merck Frosst, Merck Sharp & Dohme or MSD
> and in Japan, as Banyu - direct contact information for affiliates is
> available at
http://www.merck.com/contact/contacts.html) that may be
> confidential, proprietary copyrighted and/or legally privileged. It is
> intended solely for the use of the individual or entity named on this
> message. If you are not the intended recipient, and have received this
> message in error, please notify us immediately by reply e-mail and then
> delete it from your system.
>
> ------------------------------------------------------------------------------
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 27 Feb 2008 - 17:34:05 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 27 Feb 2008 - 21:30:16 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive