Re: [R] randomForest and missing data

From: Darin A. England <england_at_cs.umn.edu>
Date: Thu 04 Jan 2007 - 22:07:10 GMT

Yes I completely agree with your statements. As far as a way around it, I would say that CART has some facilities for dealing with missing data. e.g. when an observation is dropped into the tree and encounters a split at which the variable is missing, then one option is to simply not send it further down the tree. One may then obtain a prediction for that interior node, albeit probably not a very good one, but it is one way to handle cases with missing values. So, my thought is that why can't we simply have that capability with randomForest as well?

Darin

On Thu, Jan 04, 2007 at 03:44:27PM -0600, Sicotte, Hugues Ph.D. wrote:
> I don't know about this module, but a general answer is that if you have
> missing data, it may affect your model. If your data is missing at
> random, then you might be lucky in your model building.
>
> If however your data was not missing at random (e.g. censoring) , you
> might build a wrong predictor.
>
> Missing at random or not, that is a question you should answer and deal
> with before modeling.
>
> I refer you to a book like
> "Analysis of Incomplete Multivariate data". By Schafer
>
> If there is a way around that with randomForest, I'd be interested to
> know too.
>
> Hugues Sicotte
>
>
> -----Original Message-----
> From: r-help-bounces@stat.math.ethz.ch
> [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Darin A. England
> Sent: Thursday, January 04, 2007 3:13 PM
> To: r-help@stat.math.ethz.ch
> Subject: [R] randomForest and missing data
>
>
> Does anyone know a reason why, in principle, a call to randomForest
> cannot accept a data frame with missing predictor values? If each
> individual tree is built using CART, then it seems like this
> should be possible. (I understand that one may impute missing values
> using rfImpute or some other method, but I would like to avoid doing
> that.)
>
> If this functionality were available, then when the trees are being
> constructed and when subsequent data are put through the forest, one
> would also specify an argument for the use of surrogate rules, just
> like in rpart.
>
> I realize this question is very specific to randomForest, as opposed
> to R in general, but any comments are appreciated. I suppose I am
> looking for someone to say "It's not appropriate, and here's why
> ..." or "Good idea. Please implement and post your code."
>
> Thanks,
>
> Darin England, Senior Scientist
> Ingenix
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri Jan 05 09:11:58 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 04 Jan 2007 - 22:30:24 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.