Re: [R] memory problems when combining randomForests

From: Ramon Diaz-Uriarte <rdiaz_at_cnio.es>
Date: Tue 01 Aug 2006 - 18:32:18 EST

Dear Eleni,

>
> But if every time you remove a variable you pass some test data (ie data
> not used to train the model) and base the performance of the new, reduced
> model on the error rate on the confusion matrix for the test data, then
> this "overfitting" should not be an issue, right? (unless of course you
> were referring to unsupervised learning).
>

Yes and no. The problem there could arise if you do this iteratively and use the minimum value you obtain with your procedure to return an estimate of the error rate. In such a case, you should, instead, do a double cross-validation or bootstrap (i.e., estimate, via cross-validation ---or the bootstrap--- the error rate of your complete procedure).

Both Andy and collaborators on the one hand and myself on the other have done some further work on these issues.

Svetnik V, Liaw A, Tong C, Wang T: Application of Breiman's random forest to modeling structure-activity relationships of pharmaceutical molecules. Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, 9–11 June 2004, Cagliari, Italy. Lecture Notes in Computer Science, Springer 2004, 3077:334-343.

Gene selection and classification of microarray data using random forest Ramón Díaz-Uriarte and Sara Alvarez de Andrés. BMC Bioinformatics 2006, 7:3. http://www.biomedcentral.com/1471-2105/7/3

Best,

R.

On Monday 31 July 2006 18:45, Eleni Rapsomaniki wrote:
> Hi Andy,
>
> > > I get different order of importance for my variables depending on their
>
> order in the training data.
>
> Perhaps answering my own question, the change in importance rankings could
> be attributed to the fact that before passing my data to randomForest I
> impute the missing values randomly (using the combined distributions of
> pos+neg), so the data seen by RF is slightly different. Then combining this
> with the fact that RF chooses data randomly it makes sense to see different
> rankings.
>
> In a previous thread regarding simplifying variables:
> http://thread.gmane.org/gmane.comp.lang.r.general/6989/focus=6993
>
> you say:
> "The basic problem is that when you select important variables by RF and
> then re-run RF with those variables, the OOB error rate become biased
> downward. As you iterate more times, the "overfitting" becomes more and
> more severe (in the sense that, the OOB error rate will keep decreasing
> while error rate on an independent test set will be flat or increases)"
>
> But if every time you remove a variable you pass some test data (ie data
> not used to train the model) and base the performance of the new, reduced
> model on the error rate on the confusion matrix for the test data, then
> this "overfitting" should not be an issue, right? (unless of course you
> were referring to unsupervised learning).
>
> Best regards
> Eleni Rapsomaniki
> Birkbeck College, UK
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
>
https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html and provide commented, minimal,
> self-contained, reproducible code.

-- 
Ramón Díaz-Uriarte
Bioinformatics 
Centro Nacional de Investigaciones Oncológicas (CNIO)
(Spanish National Cancer Center)
Melchor Fernández Almagro, 3
28029 Madrid (Spain)
Fax: +-34-91-224-6972
Phone: +-34-91-224-6900

http://ligarto.org/rdiaz
PGP KeyID: 0xE89B3462
(http://ligarto.org/rdiaz/0xE89B3462.asc)



**NOTA DE CONFIDENCIALIDAD** Este correo electrnico, y en s...{{dropped}}



______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

Received on Tue Aug 01 18:38:41 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 01 Aug 2006 - 20:17:49 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.