Re: [R] Half Million features Selection (Random Forest)

From: Prof Brian Ripley <>
Date: Sat 03 Jul 2004 - 15:58:15 EST

How many cases do you have? Since you apparently expect the dataset to be usable in R, you only have room to store a dataset with 200 cases or so (let alone space to analyse it).

Even selecting *one* variable is statistically nonsensical with less than millions of cases (as otherwise the possibility of chance agreement of predictors is too high -- and I don't known enough about your problem to do even a rough calculation with any confidence).

On Fri, 2 Jul 2004, daisy wrote:

> I have about half million binary features, and would like to find a
> model to estimate the continous response. According to the inference, I
> can express predictors and response by linear model. (ie. Design matrix:
> large sparse matrix with 0/1. Response: Continous number) Since it is
> not a classification problem, someone suggested me to try random forest
> in R. However, in the randomForest help page, it points out "For large
> data sets, especially those with large number of variables, calling
> 'randomForest' via the formula interface is not advised: There may be
> too much overhead in handling the formula." and I also gave a try on 300
> variables and R either gave me error message or no response. (OS:
> Windows XP; R:1.9.0 ; RAM:512MB) Is there any way to implement random
> forest on this big dataset? Any suggestion is welcome! Many thanks!

Brian D. Ripley,        
Professor of Applied Statistics,
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________ mailing list
PLEASE do read the posting guide!
Received on Sat Jul 03 16:01:26 2004

This archive was generated by hypermail 2.1.8 : Wed 03 Nov 2004 - 22:54:40 EST