Re: [R] Cforest and Random Forest memory use

From: Raubertas, Richard <richard_raubertas_at_merck.com>
Date: Fri, 18 Jun 2010 15:48:32 -0400

Max,
My disagreement was really just about the single statement 'I suspect that >1M points are pretty densely packed into 40-dimensional space' in your original post. On the larger issue of diminishing returns with the size of a training set, I agree with your points below.

Rich

> -----Original Message-----
> From: Max Kuhn [mailto:mxkuhn_at_gmail.com]
> Sent: Friday, June 18, 2010 1:35 PM
> To: Bert Gunter
> Cc: Raubertas, Richard; Matthew OKane; r-help_at_r-project.org
> Subject: Re: [R] Cforest and Random Forest memory use
>
> Rich's calculations are correct, but from a practical standpoint I
> think that using all the data for the model is overkill for a few
> reasons:
>
> - the calculations that you show implicitly assume that the predictor
> values can be reliably differentiated from each other. Unless they are
> deterministic calculations (e.g. number of hydrogen bonds, % GC in a
> sequence) the measurement error. We don't know anything about the
> context here, but in the lab sciences, the measurement variation can
> make the *effective* number of predictor values much less than n. So
> you can have millions of predictor values but you might only be able
> to differentiate k <<<< n values reliably.
>
> - the important dimensionality to consider is based on how many of
> those 40 are relevant to the outcome. Again, we don't now the context
> of the data but there is a strong prior towards the number of
> important variables being less than 40
>
> - We've had to consider these types of problems a lot. We might have
> 200K samples (compounds in this case) and 1000 predictors that appear
> to matter. Ensembles of trees tended to do very well, as did kernel
> methods. In either of those two classes of models, the prediction time
> for a single new observation is very long. So we looked at how
> performance was affected if we were to reduce the training set size.
> In essence, we found that <50% of the data could be used with no
> appreciable effect on performance. We could make the percentage
> smaller if we used the predictor values to sample the data set for
> prediction; if we had m samples in the training set, the next sample
> added would have to have maximum dissimilarity to the existing m
> samples.
>
> - If you are going to do any feature selection, you would be better
> off segregating a percentage of those million samples as a hold-out
> set to validate the selection process (a few people form Merck have
> written excellent papers on the selection bias problem). Similarly, if
> this is a classification problem, any ROC curve analysis is most
> effective when the cutoffs are derived from a separate hold-out data
> set. Just dumping all those samples in a training set seems like a
> lost opportunity.
>
> Again, these are not refutations of your calculations. I just think
> that there are plenty of non-theoretical arguments for not using all
> of those values for the training set.
>
> Thanks,
>
> Max
> On Fri, Jun 18, 2010 at 11:41 AM, Bert Gunter
> <gunter.berton_at_gene.com> wrote:
> > Rich is right, of course. One way to think about it is this
> (parphrased from
> > the section on the "Curse of Dimensionality" from Hastie et al's
> > "Statistical Learning" Book): suppose 10 uniformly
> distributed points on a
> > line give what you consider to be adequate coverage of the
> line. Then in 40
> > dimensions, you'd need 10^40 uniformly distributed points
> to give equivalent
> > coverage.
> >
> > Various other aspects of the curse of dimensionality are
> discussed in the
> > book, one of which is that in high dimensions, most points
> are closer to the
> > boundaries then to each other. As Rich indicates, this has profound
> > implications for what one can sensibly do with such data.
> On example is:
> > nearest neighbor procedures don't make much sense (as
> nobody is likely to
> > have anybody else nearby). Which Rich's little simulation nicely
> > demonstrated.
> >
> > Cheers to all,
> >
> > Bert Gunter
> > Genentech Nonclinical Statistics
> >
> >
> >
> > -----Original Message-----
> > From: r-help-bounces_at_r-project.org
> [mailto:r-help-bounces_at_r-project.org] On
> > Behalf Of Raubertas, Richard
> > Sent: Thursday, June 17, 2010 4:15 PM
> > To: Max Kuhn; Matthew OKane
> > Cc: r-help_at_r-project.org
> > Subject: Re: [R] Cforest and Random Forest memory use
> >
> >
> >
> >> -----Original Message-----
> >> From: r-help-bounces_at_r-project.org
> >> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Max Kuhn
> >> Sent: Monday, June 14, 2010 10:19 AM
> >> To: Matthew OKane
> >> Cc: r-help_at_r-project.org
> >> Subject: Re: [R] Cforest and Random Forest memory use
> >>
> >> The first thing that I would recommend is to avoid the "formula
> >> interface" to models. The internals that R uses to create matrices
> >> form a formula+data set are not efficient. If you had a
> large number
> >> of variables, I would have automatically pointed to that
> as a source
> >> of issues. cforest and ctree only have formula interfaces
> though, so
> >> you are stuck on that one. The randomForest package has both
> >> interfaces, so that might be better.
> >>
> >> Probably the issue is the depth of the trees. With that many
> >> observations, you are likely to get extremely deep trees. You might
> >> try limiting the depth of the tree and see if that has an effect on
> >> performance.
> >>
> >> We run into these issues with large compound libraries; in
> those cases
> >> we do whatever we can to avoid ensembles of trees or
> kernel methods.
> >> If you want those, you might need to write your own code that is
> >> hyper-efficient and tuned to your particular data structure (as we
> >> did).
> >>
> >> On another note... are this many observations really
> needed? You have
> >> 40ish variables; I suspect that >1M points are pretty
> densely packed
> >> into 40-dimensional space.
> >
> > This did not seem right to me:  40-dimensional space is
> very, very big
> > and even a million observations will be thinly spread.  
There is probably
> > some analytic result from the theory of coverage processes
> about this,
> > but I just did a quick simulation.  If a million samples
> are independently
> > and randomly distributed in a 40-d unit hypercube, then
> >90% of the points
> > in the hypercube will be more than one-quarter of the
> maximum possible
> > distance (sqrt(40)) from the nearest sample.  And about 40%
> of the hypercube
> >
> > will be more than one-third of the maximum possible
> distance to the nearest
> > sample.  So the samples do not densely cover the space at all.
> >
> > One implication is that modeling the relation of a response
> to 40 predictors
> >
> > will inevitably require a lot of smoothing, even with a
> million data points.
> >
> > Richard Raubertas
> > Merck & Co.
> >
> >> Do you loose much by sampling the data set
> >> or allocating a large portion to a test set? If you have
> thousands of
> >> predictors, I could see the need for so many observations, but I'm
> >> wondering if many of the samples are redundant.
> >>
> >> Max
> >>
> >> On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane
> >> <mlokane_at_gmail.com> wrote:
> >> > Answers added below.
> >> > Thanks again,
> >> > Matt
> >> >
> >> > On 11 June 2010 14:28, Max Kuhn <mxkuhn_at_gmail.com> wrote:
> >> >>
> >> >> Also, you have not said:
> >> >>
> >> >>  - your OS: Windows Server 2003 64-bit
> >> >>  - your version of R: 2.11.1 64-bit
> >> >>  - your version of party: 0.9-9995
> >> >
> >> >
> >> >>
> >> >>  - your code:  test.cf <-(formula=badflag~.,data =
> >> >> example,control=cforest_control
> >> >
> >> >                                              (teststat =
> >> 'max', testtype =
> >> > 'Teststatistic', replace = FALSE, ntree = 500,
> >> savesplitstats = FALSE,mtry =
> >> > 10))
> >> >
> >> >>  - what "Large data set" means: > 1 million observations,
> >> 40+ variables,
> >> >> around 200MB
> >> >>  - what "very large model objects" means - anything which breaks
> >> >>
> >> >> So... how is anyone suppose to help you?
> >> >>
> >> >> Max
>
Notice: This e-mail message, together with any attachme...{{dropped:11}}



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 18 Jun 2010 - 19:52:59 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 18 Jun 2010 - 20:20:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive