From: Max Kuhn <mxkuhn_at_gmail.com>

Date: Fri, 18 Jun 2010 13:35:23 -0400

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 18 Jun 2010 - 17:38:18 GMT

Date: Fri, 18 Jun 2010 13:35:23 -0400

Again, these are not refutations of your calculations. I just think that there are plenty of non-theoretical arguments for not using all of those values for the training set.

Thanks,

Max

On Fri, Jun 18, 2010 at 11:41 AM, Bert Gunter <gunter.berton_at_gene.com> wrote:

> Rich is right, of course. One way to think about it is this (parphrased from

*> the section on the "Curse of Dimensionality" from Hastie et al's
**> "Statistical Learning" Book): suppose 10 uniformly distributed points on a
**> line give what you consider to be adequate coverage of the line. Then in 40
**> dimensions, you'd need 10^40 uniformly distributed points to give equivalent
**> coverage.
**>
**> Various other aspects of the curse of dimensionality are discussed in the
**> book, one of which is that in high dimensions, most points are closer to the
**> boundaries then to each other. As Rich indicates, this has profound
**> implications for what one can sensibly do with such data. On example is:
**> nearest neighbor procedures don't make much sense (as nobody is likely to
**> have anybody else nearby). Which Rich's little simulation nicely
**> demonstrated.
**>
**> Cheers to all,
**>
**> Bert Gunter
**> Genentech Nonclinical Statistics
**>
**>
**>
**> -----Original Message-----
**> From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On
**> Behalf Of Raubertas, Richard
**> Sent: Thursday, June 17, 2010 4:15 PM
**> To: Max Kuhn; Matthew OKane
**> Cc: r-help_at_r-project.org
**> Subject: Re: [R] Cforest and Random Forest memory use
**>
**>
**>
**>> -----Original Message-----
**>> From: r-help-bounces_at_r-project.org
**>> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Max Kuhn
**>> Sent: Monday, June 14, 2010 10:19 AM
**>> To: Matthew OKane
**>> Cc: r-help_at_r-project.org
**>> Subject: Re: [R] Cforest and Random Forest memory use
**>>
**>> The first thing that I would recommend is to avoid the "formula
**>> interface" to models. The internals that R uses to create matrices
**>> form a formula+data set are not efficient. If you had a large number
**>> of variables, I would have automatically pointed to that as a source
**>> of issues. cforest and ctree only have formula interfaces though, so
**>> you are stuck on that one. The randomForest package has both
**>> interfaces, so that might be better.
**>>
**>> Probably the issue is the depth of the trees. With that many
**>> observations, you are likely to get extremely deep trees. You might
**>> try limiting the depth of the tree and see if that has an effect on
**>> performance.
**>>
**>> We run into these issues with large compound libraries; in those cases
**>> we do whatever we can to avoid ensembles of trees or kernel methods.
**>> If you want those, you might need to write your own code that is
**>> hyper-efficient and tuned to your particular data structure (as we
**>> did).
**>>
**>> On another note... are this many observations really needed? You have
**>> 40ish variables; I suspect that >1M points are pretty densely packed
**>> into 40-dimensional space.
**>
**> This did not seem right to me: 40-dimensional space is very, very big
**> and even a million observations will be thinly spread. There is probably
**> some analytic result from the theory of coverage processes about this,
**> but I just did a quick simulation. If a million samples are independently
**> and randomly distributed in a 40-d unit hypercube, then >90% of the points
**> in the hypercube will be more than one-quarter of the maximum possible
**> distance (sqrt(40)) from the nearest sample. And about 40% of the hypercube
**>
**> will be more than one-third of the maximum possible distance to the nearest
**> sample. So the samples do not densely cover the space at all.
**>
**> One implication is that modeling the relation of a response to 40 predictors
**>
**> will inevitably require a lot of smoothing, even with a million data points.
**>
**> Richard Raubertas
**> Merck & Co.
**>
**>> Do you loose much by sampling the data set
**>> or allocating a large portion to a test set? If you have thousands of
**>> predictors, I could see the need for so many observations, but I'm
**>> wondering if many of the samples are redundant.
**>>
**>> Max
**>>
**>> On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane
**>> <mlokane_at_gmail.com> wrote:
**>> > Answers added below.
**>> > Thanks again,
**>> > Matt
**>> >
**>> > On 11 June 2010 14:28, Max Kuhn <mxkuhn_at_gmail.com> wrote:
**>> >>
**>> >> Also, you have not said:
**>> >>
**>> >> - your OS: Windows Server 2003 64-bit
**>> >> - your version of R: 2.11.1 64-bit
**>> >> - your version of party: 0.9-9995
**>> >
**>> >
**>> >>
**>> >> - your code: test.cf <-(formula=badflag~.,data =
**>> >> example,control=cforest_control
**>> >
**>> > (teststat =
**>> 'max', testtype =
**>> > 'Teststatistic', replace = FALSE, ntree = 500,
**>> savesplitstats = FALSE,mtry =
**>> > 10))
**>> >
**>> >> - what "Large data set" means: > 1 million observations,
**>> 40+ variables,
**>> >> around 200MB
**>> >> - what "very large model objects" means - anything which breaks
**>> >>
**>> >> So... how is anyone suppose to help you?
**>> >>
**>> >> Max
*

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 18 Jun 2010 - 17:38:18 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Fri 18 Jun 2010 - 20:10:34 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*