Re: [R] missing data imputation

From: Ted Harding <>
Date: Sun 10 Jul 2005 - 00:03:35 EST

On 08-Jul-05 Anders Schwartz Corr wrote:
> Dear R-help,
> I am trying to impute missing data for the first time using R.
> The norm package seems to work for me, but the missing values
> that it returns seem odd at times -- for example it returns
> negative values for a variable that should only be positive.
> Does this matter in data analysis, and/or is there a way to
> limit the imputed values to be within the minimum and
> maximum of the actual data? Below is the code I am using.

If you have a variable that should only be positive, then strictly speaking you should not treat it as normally distributed, since a normal distribution -- however large the mean, however small the variance -- theoretically has positive probability of giving negative values. So what you have observed in your data is within the job-description of the normal distribution.

In practice, whether this matters in data analysis depends on the range of values in a typical dataset, on the mean and SD of a typical fitted normal distribution, on the probability that such a distribution will give a negative value, and on the sample size. (Evem if P(<0) is only 10^(-4), if you are dealing with sample sizes of 10^6 you are very likely to get some negative values).

Whether it matters in practice also depends, of course, on whether it matters in practice. What, in the real world, will break if there's a negative value or two in there?

In many cases people simply treat negative estimates of variables which are intrinsically non-negative very crudely: if it comes out negative, replaceit with zero. This too is often a quick fix where the fact that it is a lie simply has no practical importance. But, of course, it may matter! That depends ... (see above).

It is also the case that imputed values generated by a procedure such as NORM have greater dispersion than the variable itself. This is a consequence of the way such imputation works, since each imputation is drawn from a *random* instance of a normal distribution, the mean and the variance of this distribution being sampled from the Bayesian posterior distribution of these parameters given the complete data and the covariates of the incomplete data. So it is more likely that an imputed value will be negative than that an observed value will be negative.

It is also worth looking at the shape of the histogram of such a variable. In many applications (though not all), this may exhibit positive skewness which would suggest that a log-normal distribution would be a better fit in any case. In that case, use the logarithm of the data, which will have (to within the adequacy of fit) have a normal distribution. Run your imputations, and then take the exponential of the results thereby transforming back to the scale of the original variable. This result is necessarily positive, so "anomalous" negative values simply cannot occur.

Also, remember that a variable to which you may have very reasonably attributed a normal distribution (because of good fit to the data) may be intrinsically positive solely for *semantic* reasons. E.g. it may be a measured length. God made all lengths positive, and you and we know this. But R, and NORM, and rnorom(), and all their friends, do not know this. Of semantics they know nothing. And the Daemon of Randomness will see a normal distribution, and mischievously spit negative values at you, simply because they are there ...

However, this is just general advice, though it may give you something to think about.

Meanwhile, I will try to have a look at the dataset whose URL you give, and see if I have any more specific comments.

I've also noted Frank Harrel's comment about aregImpute, and will bear it in mind. Note, however, that this does not do multiple imputation on the same lines as NORM (or the other Shafer-derived MI packages). See ?aregImpute section "Details". And, specifically, from the "Description":

  "The 'transcan' function creates flexible additive imputation    models but provides only an approximation to true multiple    imputation as the imputation models are fixed before all    multiple imputations are drawn. This ignores variability    caused by having to fit the imputation models. 'aregImpute'    takes all aspects of uncertainty in the imputations into    account by using the bootstrap to approximate the process    of drawing predicted values from a full Bayesian predictive    distribution."

so that the Rubin/Shafer method described above (see paragraph about dispersion of imputed values) is not fully implemented.   

Best wishes,
Ted. mailing list PLEASE do read the posting guide! Received on Sun Jul 10 00:37:33 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:27 EST