# Re: [R] missing data imputation

From: Ted Harding <Ted.Harding_at_nessie.mcc.ac.uk>
Date: Sun 10 Jul 2005 - 00:03:35 EST

On 08-Jul-05 Anders Schwartz Corr wrote:
>
> Dear R-help,
>
> I am trying to impute missing data for the first time using R.
> The norm package seems to work for me, but the missing values
> that it returns seem odd at times -- for example it returns
> negative values for a variable that should only be positive.
> Does this matter in data analysis, and/or is there a way to
> limit the imputed values to be within the minimum and
> maximum of the actual data? Below is the code I am using.

If you have a variable that should only be positive, then strictly speaking you should not treat it as normally distributed, since a normal distribution -- however large the mean, however small the variance -- theoretically has positive probability of giving negative values. So what you have observed in your data is within the job-description of the normal distribution.

In practice, whether this matters in data analysis depends on the range of values in a typical dataset, on the mean and SD of a typical fitted normal distribution, on the probability that such a distribution will give a negative value, and on the sample size. (Evem if P(<0) is only 10^(-4), if you are dealing with sample sizes of 10^6 you are very likely to get some negative values).

Whether it matters in practice also depends, of course, on whether it matters in practice. What, in the real world, will break if there's a negative value or two in there?

It is also the case that imputed values generated by a procedure such as NORM have greater dispersion than the variable itself. This is a consequence of the way such imputation works, since each imputation is drawn from a *random* instance of a normal distribution, the mean and the variance of this distribution being sampled from the Bayesian posterior distribution of these parameters given the complete data and the covariates of the incomplete data. So it is more likely that an imputed value will be negative than that an observed value will be negative.

It is also worth looking at the shape of the histogram of such a variable. In many applications (though not all), this may exhibit positive skewness which would suggest that a log-normal distribution would be a better fit in any case. In that case, use the logarithm of the data, which will have (to within the adequacy of fit) have a normal distribution. Run your imputations, and then take the exponential of the results thereby transforming back to the scale of the original variable. This result is necessarily positive, so "anomalous" negative values simply cannot occur.

Also, remember that a variable to which you may have very reasonably attributed a normal distribution (because of good fit to the data) may be intrinsically positive solely for *semantic* reasons. E.g. it may be a measured length. God made all lengths positive, and you and we know this. But R, and NORM, and rnorom(), and all their friends, do not know this. Of semantics they know nothing. And the Daemon of Randomness will see a normal distribution, and mischievously spit negative values at you, simply because they are there ...

However, this is just general advice, though it may give you something to think about.

I've also noted Frank Harrel's comment about aregImpute, and will bear it in mind. Note, however, that this does not do multiple imputation on the same lines as NORM (or the other Shafer-derived MI packages). See ?aregImpute section "Details". And, specifically, from the "Description":

so that the Rubin/Shafer method described above (see paragraph about dispersion of imputed values) is not fully implemented.

R-help@stat.math.ethz.ch mailing list