RE: [Rd] RE: [R] using poly in a linear regression in the presence of NAf ails (despite subsetting them out)

From: John Fox <jfox_at_mcmaster.ca>
Date: Sun 20 Feb 2005 - 00:36:44 EST


Dear Brian,

Whenever I disagree with you, I wonder what error I made, and almost always discover that there was something I missed.

> -----Original Message-----
> From: r-devel-bounces@stat.math.ethz.ch
> [mailto:r-devel-bounces@stat.math.ethz.ch] On Behalf Of Prof
> Brian Ripley
> Sent: Saturday, February 19, 2005 1:09 AM
> To: John Fox
> Cc: 'Markus Jantti'; r-devel@stat.math.ethz.ch; 'Liaw,Andy'
> Subject: [Rd] RE: [R] using poly in a linear regression in
> the presence of NAf ails (despite subsetting them out)
>
> On Fri, 18 Feb 2005, John Fox wrote:
>
> > Dear Andy, Brian, and Markus,
> >
> > I've moved this to r-devel because the issue is a bit esoteric. I
> > apologize for joining the discussion so late, but didn't have time
> > earlier in the week to formulate these ideas.
>
> I think you didn't take time to check what happens in R-devel, though.

I did check the R 2.1 news file for poly, na.action, and na.omit, but didn't find anything. Did I miss something?

>
> > I believe that I understand and appreciate Brian's point, but think
> > that the issue isn't entirely clear-cut.
> >
> > It seems to me that two kinds of questions arise with missing data.
> > The deeper ones are statistical but there are also nontrivial
> > mechanical issues such as the one here.
> >
> > Whenever a term in a model is parametrized with a data-dependent
> > basis, there's a potential for problems and confusion --
> manifested,
> > for example, in the issue of "safe" prediction. I don't think that
> > these problems are unique to missing data. On the other hand, the
> > basis selected for the subspace spanned by a term shouldn't
> be the most important consideration.
> > That is, when models are equivalent -- as, for example lm(y ~ x +
> > I(x^2)) and lm(y ~ poly(x, 2)), an argument could be made
> for treating
> > them similarly.
>
> Only in linear models, and we are here talking about general
> behaviour of finding the model frame.
>

That's a good point -- I hadn't considered it. I don't see the full ramifications, however.

> > Brian's point that NAs, say, in x2, can influence the basis for
> > poly(x1, 2) is disquieting, but note that this can happen
> now if there are no NAs in x1.
>
> That is not what I said, and it is incorrect. poly(x1, 2) is
> determined by the set of values of x1 (so they need all to be
> known hence no NAs in
> x1) and nothing else. Example:

What you said was, "The problem is that poly(x, 2) depends on the possible set of x values, and so needs to know all of them, unlike e.g. log(x) which is observation-by-observation. Silently omitting missing values is not a good idea in such cases, especially if the values are missing in other variables (which is what na.action is likely to do)." I misinterpreted it.

>
> > y <- rnorm(10)
> > x1 <- 1:10
> > x2 <- c(NA, runif(9))
> > model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit)
> y poly(x1, 2).1 poly(x1, 2).2 x2
> 2 -0.5110095 -0.38533732 0.17407766 0.07377988
> 3 -0.9111954 -0.27524094 -0.08703883 0.30968660
> 4 -0.8371717 -0.16514456 -0.26111648 0.71727174
> 5 2.4158352 -0.05504819 -0.34815531 0.50454591
> 6 0.1340882 0.05504819 -0.34815531 0.15299896
> 7 -0.4906859 0.16514456 -0.26111648 0.50393349
> 8 -0.4405479 0.27524094 -0.08703883 0.49396092
> 9 0.4595894 0.38533732 0.17407766 0.75120020
> 10 -0.6937202 0.49543369 0.52223297 0.17464982
> > model.frame(y ~ poly(x1, 2) + x2, na.action = na.pass)
> y poly(x1, 2).1 poly(x1, 2).2 x2
> 1 -0.1102855 -0.49543369 0.52223297 NA
> 2 -0.5110095 -0.38533732 0.17407766 0.07377988
> 3 -0.9111954 -0.27524094 -0.08703883 0.30968660
> 4 -0.8371717 -0.16514456 -0.26111648 0.71727174
> 5 2.4158352 -0.05504819 -0.34815531 0.50454591
> 6 0.1340882 0.05504819 -0.34815531 0.15299896
> 7 -0.4906859 0.16514456 -0.26111648 0.50393349
> 8 -0.4405479 0.27524094 -0.08703883 0.49396092
> 9 0.4595894 0.38533732 0.17407766 0.75120020
> 10 -0.6937202 0.49543369 0.52223297 0.17464982
>
>

The point that I was trying to make is this:

> y <- rnorm(10)
> x1 <- 1:10
> x2 <- c(NA, runif(9))
> mf <- model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit)
> cor(mf[,2])

          1 2
1 1.0000000 0.4037864
2 0.4037864 1.0000000
>
> Data <- na.omit(data.frame(y, x1, x2))
> mf <- model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit, data=Data)
> cor(mf[,2])

  1 2
1 1 0
2 0 1
>

> > The point, therefore, doesn't really justify the current
> behaviour of
> > poly(). Indeed, if there are NAs in x2 but not in x1, the columns
> > representing poly(x1, 2) won't be orthogonal in the subset of cases
> > used in the model fit (though they do provide a correct
> basis for the term).
>
> It is currently documented to not allow NAs:
>
> x, newdata: a numeric vector at which to evaluate the polynomial. 'x'
> can also be a matrix. Missing values are not
> allowed in 'x'.
>
> Why does that need any justification?
>

Because one might expect (perhaps, you could argue, one shouldn't expect) the basis for the polynomial term to be orthogonal in the subset of observations actually used for the fit.

> > Consider another example -- lm(y ~ f, subset = f != "a"),
> where f is a
> > factor with levels "a", "b", and "c", and where there are NAs in f.
> > Here the basis for the f term is data dependent, in that
> the baseline
> > level is taken as "b" in the absence of "a", yet NAs don't
> cause an error.
> >
> > Having poly(), bs(), and ns() report a more informative
> error message
> > certainly is preferable to the current situation, but an
> alternative
> > would be to have them work and perhaps report a warning.
>
> What exactly does `work' mean here? Run and give misleading answers?
>

That implies that there's a simple right answer answer here, which I don't think is the case. I'm persuaded that what I recommended is not good idea because of its more general implications, but it's not obvious to me that creating a basis that's orthogonal in the subset of observations used for the fit is misleading, while the current behaviour isn't misleading.

> > If the object is to protect people from stepping into traps
> because of
> > missing data, then an argument could be made for having the default
> > na.action be na.fail, as in S-PLUS. (I wouldn't advocate
> this, by the
> > way.) Perhaps I'm missing some consideration here, but isn't it
> > clearest to allow the data, subset, and na.action arguments to
> > determine the data in which the formula is evaluated?
>
> Exactly how? It would be a major change from the current
> methodology and I am not sure you appreciate what that is.
>

Yes, I looked at how this works currently, but agree that I didn't appreciate the implications of changing it -- e.g., for nonlinear models. I still don't entirely appreciate those implications.

> Remember that na.action could for example replace missing
> values by random imputations, or even multiple imputations,
> and na.action comes quite late in the process *after* the
> model frame has been formed. Unlike factors,
> poly() etc generate multiple columns in the model frame, and
> then subset and na.action are applied. Factors are encoded
> in model.matrix (and that is really only used for the linear
> parts of models, unlike model.frame).
>
> I am not sure you appreciate the basic point: poly is applied
> to the whole dataset and not to a subset, and that can be
> important (e.g. to ensure no unstable extrapolation when
> predicting later).

Note, however, that a user could create the subset as a data frame through na.omit() prior to applying poly() -- as I did in the example above. In fact, that's probably what a user of poly() would now want to do.

> Really na.action has to come last, as
> functions in the formula could themselves generate NAs
> (log(0) for example).
>

I did understand the sequence of operations, but didn't see the point about newly generated NAs. I agree that this makes it necessary to remove NAs last.

> There is an alternative, that poly() works only on finite
> values and passes through non-finite ones, but it was
> deliberately not written that way and you will need to
> convince everyone that would be a better solution. (What
> happens for example if there are only two finite values?) If
> you want such a function, it is easy for you to provide it:
> just please don't call it poly(). [Writing poly() was not
> easy BTW, but poly_allow_NAs would be given poly.]
>

I'm no longer convinced that what I proposed is a better solution, but the current situation seems problematic to me as well.

Thanks for the extended explanation.

John

> --
> Brian D. Ripley, ripley@stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
> ______________________________________________
> R-devel@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



R-devel@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Feb 19 23:43:31 2005

This archive was generated by hypermail 2.1.8 : Sun 20 Feb 2005 - 01:30:26 EST