Re: [R] missing data imputation

From: Ted Harding <Ted.Harding_at_nessie.mcc.ac.uk>
Date: Sun 10 Jul 2005 - 01:49:08 EST

On 09-Jul-05 Ted Harding wrote:
> On 08-Jul-05 Anders Schwartz Corr wrote:

>> [...]

> ]...]
> Meanwhile, I will try to have a look at the dataset whose URL
> you give, and see if I have any more specific comments.

Now that I look at the histograms of your 21 variables, I would not think of treating most of them as anything like normally distributed (for a few, a normal distribution might roughly reflect the underlying distribution, though it would only fit where it touches).

Nor is it obvious what kind of distribution to think of trying for many of them. Perhaps you have ideas, from your knowledge of the field the data were drawn from, of what kind of model to use. But not many types of explicit model are implemented MI software anywhere, let alone in R.

These considerations rule out trying NORM or anything similar, since such approaches depend strongly on a reasonably good model for the distribution of the data.

In any case, it looks as though some of them are categorical, with 2 or 3 levels, and NORM is rarely good for such variables. You should in any case consider the 'mix' package when some variables are discrete and some are continuous (and can be assumed to be, or transformed to be) normally distributed. But. for the reasons above, I wouldn't go in that direction anyway.

> I've also noted Frank Harrel's comment about aregImpute, and
> will bear it in mind.
> [...]

The sort of approach implied by the above comments suggests an approach which is much less dependent on model assumptions.

The most model-free approach is in the family of "hot deck" approaches where the imputed values of a variable are randomly sampled from the observed values of this variable, attempting to match the observed covariates of the group sampled from with the observed covariates of the value to be imputed.

I've not used aregImpute, but from reading ?aregImpute it does seem that there is an underlying "hot deck" mechanism, so it may suit your purpose well. However, from the "Description" and "Details" of aregImpute, it seems that there is also an element of quasi-modelling involved as well, albeit on a basically non-parametric basis.

The person to comment on this would be Frank Harrell himself!

Best wishes,
Ted.



E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861
Date: 09-Jul-05                                       Time: 16:49:04
------------------------------ XFMail ------------------------------

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sun Jul 10 02:03:40 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:27 EST