Re: [R] Cforest and Random Forest memory use

From: Max Kuhn <mxkuhn_at_gmail.com>
Date: Fri, 18 Jun 2010 13:35:23 -0400

Rich's calculations are correct, but from a practical standpoint I think that using all the data for the model is overkill for a few reasons:

Again, these are not refutations of your calculations. I just think that there are plenty of non-theoretical arguments for not using all of those values for the training set.

Thanks,

Max
On Fri, Jun 18, 2010 at 11:41 AM, Bert Gunter <gunter.berton_at_gene.com> wrote:
> Rich is right, of course. One way to think about it is this (parphrased from
> the section on the "Curse of Dimensionality" from Hastie et al's
> "Statistical Learning" Book): suppose 10 uniformly distributed points on a
> line give what you consider to be adequate coverage of the line. Then in 40
> dimensions, you'd need 10^40 uniformly distributed points to give equivalent
> coverage.
>
> Various other aspects of the curse of dimensionality are discussed in the
> book, one of which is that in high dimensions, most points are closer to the
> boundaries then to each other. As Rich indicates, this has profound
> implications for what one can sensibly do with such data. On example is:
> nearest neighbor procedures don't make much sense (as nobody is likely to
> have anybody else nearby). Which Rich's little simulation nicely
> demonstrated.
>
> Cheers to all,
>
> Bert Gunter
> Genentech Nonclinical Statistics
>
>
>
> -----Original Message-----
> From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On
> Behalf Of Raubertas, Richard
> Sent: Thursday, June 17, 2010 4:15 PM
> To: Max Kuhn; Matthew OKane
> Cc: r-help_at_r-project.org
> Subject: Re: [R] Cforest and Random Forest memory use
>
>
>
>> -----Original Message-----
>> From: r-help-bounces_at_r-project.org
>> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Max Kuhn
>> Sent: Monday, June 14, 2010 10:19 AM
>> To: Matthew OKane
>> Cc: r-help_at_r-project.org
>> Subject: Re: [R] Cforest and Random Forest memory use
>>
>> The first thing that I would recommend is to avoid the "formula
>> interface" to models. The internals that R uses to create matrices
>> form a formula+data set are not efficient. If you had a large number
>> of variables, I would have automatically pointed to that as a source
>> of issues. cforest and ctree only have formula interfaces though, so
>> you are stuck on that one. The randomForest package has both
>> interfaces, so that might be better.
>>
>> Probably the issue is the depth of the trees. With that many
>> observations, you are likely to get extremely deep trees. You might
>> try limiting the depth of the tree and see if that has an effect on
>> performance.
>>
>> We run into these issues with large compound libraries; in those cases
>> we do whatever we can to avoid ensembles of trees or kernel methods.
>> If you want those, you might need to write your own code that is
>> hyper-efficient and tuned to your particular data structure (as we
>> did).
>>
>> On another note... are this many observations really needed? You have
>> 40ish variables; I suspect that >1M points are pretty densely packed
>> into 40-dimensional space.
>
> This did not seem right to me:  40-dimensional space is very, very big
> and even a million observations will be thinly spread.  There is probably
> some analytic result from the theory of coverage processes about this,
> but I just did a quick simulation.  If a million samples are independently
> and randomly distributed in a 40-d unit hypercube, then >90% of the points
> in the hypercube will be more than one-quarter of the maximum possible
> distance (sqrt(40)) from the nearest sample.  And about 40% of the hypercube
>
> will be more than one-third of the maximum possible distance to the nearest
> sample.  So the samples do not densely cover the space at all.
>
> One implication is that modeling the relation of a response to 40 predictors
>
> will inevitably require a lot of smoothing, even with a million data points.
>

> Richard Raubertas
> Merck & Co.
>
>> Do you loose much by sampling the data set
>> or allocating a large portion to a test set? If you have thousands of
>> predictors, I could see the need for so many observations, but I'm
>> wondering if many of the samples are redundant.
>>
>> Max
>>
>> On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane
>> <mlokane_at_gmail.com> wrote:
>> > Answers added below.
>> > Thanks again,
>> > Matt
>> >
>> > On 11 June 2010 14:28, Max Kuhn <mxkuhn_at_gmail.com> wrote:
>> >>
>> >> Also, you have not said:
>> >>
>> >>  - your OS: Windows Server 2003 64-bit
>> >>  - your version of R: 2.11.1 64-bit
>> >>  - your version of party: 0.9-9995
>> >
>> >
>> >>
>> >>  - your code:  test.cf <-(formula=badflag~.,data =
>> >> example,control=cforest_control
>> >
>> >                                              (teststat =
>> 'max', testtype =
>> > 'Teststatistic', replace = FALSE, ntree = 500,
>> savesplitstats = FALSE,mtry =
>> > 10))
>> >
>> >>  - what "Large data set" means: > 1 million observations,
>> 40+ variables,
>> >> around 200MB
>> >>  - what "very large model objects" means - anything which breaks
>> >>
>> >> So... how is anyone suppose to help you?
>> >>
>> >> Max



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 18 Jun 2010 - 17:38:18 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 18 Jun 2010 - 20:10:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive