Re: [R] Cforest and Random Forest memory use

From: Max Kuhn <mxkuhn_at_gmail.com>
Date: Mon, 14 Jun 2010 10:19:11 -0400

The first thing that I would recommend is to avoid the "formula interface" to models. The internals that R uses to create matrices form a formula+data set are not efficient. If you had a large number of variables, I would have automatically pointed to that as a source of issues. cforest and ctree only have formula interfaces though, so you are stuck on that one. The randomForest package has both interfaces, so that might be better.

Probably the issue is the depth of the trees. With that many observations, you are likely to get extremely deep trees. You might try limiting the depth of the tree and see if that has an effect on performance.

We run into these issues with large compound libraries; in those cases we do whatever we can to avoid ensembles of trees or kernel methods. If you want those, you might need to write your own code that is hyper-efficient and tuned to your particular data structure (as we did).

On another note... are this many observations really needed? You have 40ish variables; I suspect that >1M points are pretty densely packed into 40-dimensional space. Do you loose much by sampling the data set or allocating a large portion to a test set? If you have thousands of predictors, I could see the need for so many observations, but I'm wondering if many of the samples are redundant.

Max

On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane <mlokane_at_gmail.com> wrote:
> Answers added below.
> Thanks again,
> Matt
>
> On 11 June 2010 14:28, Max Kuhn <mxkuhn_at_gmail.com> wrote:
>>
>> Also, you have not said:
>>
>>  - your OS: Windows Server 2003 64-bit
>>  - your version of R: 2.11.1 64-bit
>>  - your version of party: 0.9-9995
>
>
>>
>>  - your code:  test.cf <-(formula=badflag~.,data =
>> example,control=cforest_control
>
>                                              (teststat = 'max', testtype =
> 'Teststatistic', replace = FALSE, ntree = 500, savesplitstats = FALSE,mtry =
> 10))
>
>>  - what "Large data set" means: > 1 million observations, 40+ variables,
>> around 200MB
>>  - what "very large model objects" means - anything which breaks
>>
>> So... how is anyone suppose to help you?
>>
>> Max
>
>

-- 

Max

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Mon 14 Jun 2010 - 14:21:46 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 17 Jun 2010 - 23:50:32 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive