From: Raubertas, Richard <richard_raubertas_at_merck.com>

Date: Fri, 18 Jun 2010 15:48:32 -0400

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 18 Jun 2010 - 19:52:59 GMT

Date: Fri, 18 Jun 2010 15:48:32 -0400

Max,

My disagreement was really just about the single statement 'I suspect
that >1M points are pretty densely packed into 40-dimensional space' in
your original post. On the larger issue of diminishing returns with
the size of a training set, I agree with your points below.

Rich

*> -----Original Message-----
**> From: Max Kuhn [mailto:mxkuhn_at_gmail.com]
**> Sent: Friday, June 18, 2010 1:35 PM
**> To: Bert Gunter
**> Cc: Raubertas, Richard; Matthew OKane; r-help_at_r-project.org
**> Subject: Re: [R] Cforest and Random Forest memory use
**>
*

> Rich's calculations are correct, but from a practical standpoint I

*> think that using all the data for the model is overkill for a few
**> reasons:
**>
**> - the calculations that you show implicitly assume that the predictor
**> values can be reliably differentiated from each other. Unless they are
**> deterministic calculations (e.g. number of hydrogen bonds, % GC in a
**> sequence) the measurement error. We don't know anything about the
**> context here, but in the lab sciences, the measurement variation can
**> make the *effective* number of predictor values much less than n. So
**> you can have millions of predictor values but you might only be able
**> to differentiate k <<<< n values reliably.
**>
**> - the important dimensionality to consider is based on how many of
**> those 40 are relevant to the outcome. Again, we don't now the context
**> of the data but there is a strong prior towards the number of
**> important variables being less than 40
**>
**> - We've had to consider these types of problems a lot. We might have
**> 200K samples (compounds in this case) and 1000 predictors that appear
**> to matter. Ensembles of trees tended to do very well, as did kernel
**> methods. In either of those two classes of models, the prediction time
**> for a single new observation is very long. So we looked at how
**> performance was affected if we were to reduce the training set size.
**> In essence, we found that <50% of the data could be used with no
**> appreciable effect on performance. We could make the percentage
**> smaller if we used the predictor values to sample the data set for
**> prediction; if we had m samples in the training set, the next sample
**> added would have to have maximum dissimilarity to the existing m
**> samples.
**>
**> - If you are going to do any feature selection, you would be better
**> off segregating a percentage of those million samples as a hold-out
**> set to validate the selection process (a few people form Merck have
**> written excellent papers on the selection bias problem). Similarly, if
**> this is a classification problem, any ROC curve analysis is most
**> effective when the cutoffs are derived from a separate hold-out data
**> set. Just dumping all those samples in a training set seems like a
**> lost opportunity.
**>
**> Again, these are not refutations of your calculations. I just think
**> that there are plenty of non-theoretical arguments for not using all
**> of those values for the training set.
**>
**> Thanks,
**>
**> Max
**> On Fri, Jun 18, 2010 at 11:41 AM, Bert Gunter
**> <gunter.berton_at_gene.com> wrote:
**> > Rich is right, of course. One way to think about it is this
**> (parphrased from
**> > the section on the "Curse of Dimensionality" from Hastie et al's
**> > "Statistical Learning" Book): suppose 10 uniformly
**> distributed points on a
**> > line give what you consider to be adequate coverage of the
**> line. Then in 40
**> > dimensions, you'd need 10^40 uniformly distributed points
**> to give equivalent
**> > coverage.
**> >
**> > Various other aspects of the curse of dimensionality are
**> discussed in the
**> > book, one of which is that in high dimensions, most points
**> are closer to the
**> > boundaries then to each other. As Rich indicates, this has profound
**> > implications for what one can sensibly do with such data.
**> On example is:
**> > nearest neighbor procedures don't make much sense (as
**> nobody is likely to
**> > have anybody else nearby). Which Rich's little simulation nicely
**> > demonstrated.
**> >
**> > Cheers to all,
**> >
**> > Bert Gunter
**> > Genentech Nonclinical Statistics
**> >
**> >
**> >
**> > -----Original Message-----
**> > From: r-help-bounces_at_r-project.org
**> [mailto:r-help-bounces_at_r-project.org] On
**> > Behalf Of Raubertas, Richard
**> > Sent: Thursday, June 17, 2010 4:15 PM
**> > To: Max Kuhn; Matthew OKane
**> > Cc: r-help_at_r-project.org
**> > Subject: Re: [R] Cforest and Random Forest memory use
**> >
**> >
**> >
**> >> -----Original Message-----
**> >> From: r-help-bounces_at_r-project.org
**> >> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Max Kuhn
**> >> Sent: Monday, June 14, 2010 10:19 AM
**> >> To: Matthew OKane
**> >> Cc: r-help_at_r-project.org
**> >> Subject: Re: [R] Cforest and Random Forest memory use
**> >>
**> >> The first thing that I would recommend is to avoid the "formula
**> >> interface" to models. The internals that R uses to create matrices
**> >> form a formula+data set are not efficient. If you had a
**> large number
**> >> of variables, I would have automatically pointed to that
**> as a source
**> >> of issues. cforest and ctree only have formula interfaces
**> though, so
**> >> you are stuck on that one. The randomForest package has both
**> >> interfaces, so that might be better.
**> >>
**> >> Probably the issue is the depth of the trees. With that many
**> >> observations, you are likely to get extremely deep trees. You might
**> >> try limiting the depth of the tree and see if that has an effect on
**> >> performance.
**> >>
**> >> We run into these issues with large compound libraries; in
**> those cases
**> >> we do whatever we can to avoid ensembles of trees or
**> kernel methods.
**> >> If you want those, you might need to write your own code that is
**> >> hyper-efficient and tuned to your particular data structure (as we
**> >> did).
**> >>
**> >> On another note... are this many observations really
**> needed? You have
**> >> 40ish variables; I suspect that >1M points are pretty
**> densely packed
**> >> into 40-dimensional space.
**> >
**> > This did not seem right to me: 40-dimensional space is
**> very, very big
**> > and even a million observations will be thinly spread.
*

There is probably

> > some analytic result from the theory of coverage processes

*> about this,
**> > but I just did a quick simulation. If a million samples
**> are independently
**> > and randomly distributed in a 40-d unit hypercube, then
**> >90% of the points
**> > in the hypercube will be more than one-quarter of the
**> maximum possible
**> > distance (sqrt(40)) from the nearest sample. And about 40%
**> of the hypercube
**> >
**> > will be more than one-third of the maximum possible
**> distance to the nearest
**> > sample. So the samples do not densely cover the space at all.
**> >
**> > One implication is that modeling the relation of a response
**> to 40 predictors
**> >
**> > will inevitably require a lot of smoothing, even with a
**> million data points.
**> >
**> > Richard Raubertas
**> > Merck & Co.
**> >
**> >> Do you loose much by sampling the data set
**> >> or allocating a large portion to a test set? If you have
**> thousands of
**> >> predictors, I could see the need for so many observations, but I'm
**> >> wondering if many of the samples are redundant.
**> >>
**> >> Max
**> >>
**> >> On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane
**> >> <mlokane_at_gmail.com> wrote:
**> >> > Answers added below.
**> >> > Thanks again,
**> >> > Matt
**> >> >
**> >> > On 11 June 2010 14:28, Max Kuhn <mxkuhn_at_gmail.com> wrote:
**> >> >>
**> >> >> Also, you have not said:
**> >> >>
**> >> >> - your OS: Windows Server 2003 64-bit
**> >> >> - your version of R: 2.11.1 64-bit
**> >> >> - your version of party: 0.9-9995
**> >> >
**> >> >
**> >> >>
**> >> >> - your code: test.cf <-(formula=badflag~.,data =
**> >> >> example,control=cforest_control
**> >> >
**> >> > (teststat =
**> >> 'max', testtype =
**> >> > 'Teststatistic', replace = FALSE, ntree = 500,
**> >> savesplitstats = FALSE,mtry =
**> >> > 10))
**> >> >
**> >> >> - what "Large data set" means: > 1 million observations,
**> >> 40+ variables,
**> >> >> around 200MB
**> >> >> - what "very large model objects" means - anything which breaks
**> >> >>
**> >> >> So... how is anyone suppose to help you?
**> >> >>
**> >> >> Max
**>
*

Notice: This e-mail message, together with any attachme...{{dropped:11}}

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 18 Jun 2010 - 19:52:59 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Fri 18 Jun 2010 - 20:20:34 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*