From: John Fox <jfox_at_mcmaster.ca>

Date: Sun 20 Feb 2005 - 00:36:44 EST

R-devel@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Feb 19 23:43:31 2005

Date: Sun 20 Feb 2005 - 00:36:44 EST

Dear Brian,

Whenever I disagree with you, I wonder what error I made, and almost always discover that there was something I missed.

*> -----Original Message-----
**> From: r-devel-bounces@stat.math.ethz.ch
**> [mailto:r-devel-bounces@stat.math.ethz.ch] On Behalf Of Prof
**> Brian Ripley
**> Sent: Saturday, February 19, 2005 1:09 AM
**> To: John Fox
**> Cc: 'Markus Jantti'; r-devel@stat.math.ethz.ch; 'Liaw,Andy'
**> Subject: [Rd] RE: [R] using poly in a linear regression in
**> the presence of NAf ails (despite subsetting them out)
**>
*

> On Fri, 18 Feb 2005, John Fox wrote:

*>
**> > Dear Andy, Brian, and Markus,
**> >
**> > I've moved this to r-devel because the issue is a bit esoteric. I
**> > apologize for joining the discussion so late, but didn't have time
**> > earlier in the week to formulate these ideas.
**>
**> I think you didn't take time to check what happens in R-devel, though.
*

I did check the R 2.1 news file for poly, na.action, and na.omit, but didn't find anything. Did I miss something?

*>
*

> > I believe that I understand and appreciate Brian's point, but think

*> > that the issue isn't entirely clear-cut.
**> >
**> > It seems to me that two kinds of questions arise with missing data.
**> > The deeper ones are statistical but there are also nontrivial
**> > mechanical issues such as the one here.
**> >
**> > Whenever a term in a model is parametrized with a data-dependent
**> > basis, there's a potential for problems and confusion --
**> manifested,
**> > for example, in the issue of "safe" prediction. I don't think that
**> > these problems are unique to missing data. On the other hand, the
**> > basis selected for the subspace spanned by a term shouldn't
**> be the most important consideration.
**> > That is, when models are equivalent -- as, for example lm(y ~ x +
**> > I(x^2)) and lm(y ~ poly(x, 2)), an argument could be made
**> for treating
**> > them similarly.
**>
**> Only in linear models, and we are here talking about general
**> behaviour of finding the model frame.
**>
*

That's a good point -- I hadn't considered it. I don't see the full ramifications, however.

> > Brian's point that NAs, say, in x2, can influence the basis for

*> > poly(x1, 2) is disquieting, but note that this can happen
**> now if there are no NAs in x1.
**>
**> That is not what I said, and it is incorrect. poly(x1, 2) is
**> determined by the set of values of x1 (so they need all to be
**> known hence no NAs in
**> x1) and nothing else. Example:
*

What you said was, "The problem is that poly(x, 2) depends on the possible set of x values, and so needs to know all of them, unlike e.g. log(x) which is observation-by-observation. Silently omitting missing values is not a good idea in such cases, especially if the values are missing in other variables (which is what na.action is likely to do)." I misinterpreted it.

*>
*

> > y <- rnorm(10)

*> > x1 <- 1:10
**> > x2 <- c(NA, runif(9))
**> > model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit)
**> y poly(x1, 2).1 poly(x1, 2).2 x2
**> 2 -0.5110095 -0.38533732 0.17407766 0.07377988
**> 3 -0.9111954 -0.27524094 -0.08703883 0.30968660
**> 4 -0.8371717 -0.16514456 -0.26111648 0.71727174
**> 5 2.4158352 -0.05504819 -0.34815531 0.50454591
**> 6 0.1340882 0.05504819 -0.34815531 0.15299896
**> 7 -0.4906859 0.16514456 -0.26111648 0.50393349
**> 8 -0.4405479 0.27524094 -0.08703883 0.49396092
**> 9 0.4595894 0.38533732 0.17407766 0.75120020
**> 10 -0.6937202 0.49543369 0.52223297 0.17464982
**> > model.frame(y ~ poly(x1, 2) + x2, na.action = na.pass)
**> y poly(x1, 2).1 poly(x1, 2).2 x2
**> 1 -0.1102855 -0.49543369 0.52223297 NA
**> 2 -0.5110095 -0.38533732 0.17407766 0.07377988
**> 3 -0.9111954 -0.27524094 -0.08703883 0.30968660
**> 4 -0.8371717 -0.16514456 -0.26111648 0.71727174
**> 5 2.4158352 -0.05504819 -0.34815531 0.50454591
**> 6 0.1340882 0.05504819 -0.34815531 0.15299896
**> 7 -0.4906859 0.16514456 -0.26111648 0.50393349
**> 8 -0.4405479 0.27524094 -0.08703883 0.49396092
**> 9 0.4595894 0.38533732 0.17407766 0.75120020
**> 10 -0.6937202 0.49543369 0.52223297 0.17464982
**>
**>
*

The point that I was trying to make is this:

> y <- rnorm(10)

*> x1 <- 1:10
**> x2 <- c(NA, runif(9))
**> mf <- model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit)
**> cor(mf[,2])
*

1 2

1 1.0000000 0.4037864

2 0.4037864 1.0000000

*>
**> Data <- na.omit(data.frame(y, x1, x2))
*

> mf <- model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit, data=Data)

> cor(mf[,2])

1 2

1 1 0

2 0 1

*>
*

> > The point, therefore, doesn't really justify the current

*> behaviour of
**> > poly(). Indeed, if there are NAs in x2 but not in x1, the columns
**> > representing poly(x1, 2) won't be orthogonal in the subset of cases
**> > used in the model fit (though they do provide a correct
**> basis for the term).
**>
**> It is currently documented to not allow NAs:
**>
**> x, newdata: a numeric vector at which to evaluate the polynomial. 'x'
**> can also be a matrix. Missing values are not
**> allowed in 'x'.
**>
**> Why does that need any justification?
**>
*

Because one might expect (perhaps, you could argue, one shouldn't expect) the basis for the polynomial term to be orthogonal in the subset of observations actually used for the fit.

> > Consider another example -- lm(y ~ f, subset = f != "a"),

*> where f is a
**> > factor with levels "a", "b", and "c", and where there are NAs in f.
**> > Here the basis for the f term is data dependent, in that
**> the baseline
**> > level is taken as "b" in the absence of "a", yet NAs don't
**> cause an error.
**> >
**> > Having poly(), bs(), and ns() report a more informative
**> error message
**> > certainly is preferable to the current situation, but an
**> alternative
**> > would be to have them work and perhaps report a warning.
**>
**> What exactly does `work' mean here? Run and give misleading answers?
**>
*

That implies that there's a simple right answer answer here, which I don't think is the case. I'm persuaded that what I recommended is not good idea because of its more general implications, but it's not obvious to me that creating a basis that's orthogonal in the subset of observations used for the fit is misleading, while the current behaviour isn't misleading.

> > If the object is to protect people from stepping into traps

*> because of
**> > missing data, then an argument could be made for having the default
**> > na.action be na.fail, as in S-PLUS. (I wouldn't advocate
**> this, by the
**> > way.) Perhaps I'm missing some consideration here, but isn't it
**> > clearest to allow the data, subset, and na.action arguments to
**> > determine the data in which the formula is evaluated?
**>
**> Exactly how? It would be a major change from the current
**> methodology and I am not sure you appreciate what that is.
**>
*

Yes, I looked at how this works currently, but agree that I didn't appreciate the implications of changing it -- e.g., for nonlinear models. I still don't entirely appreciate those implications.

> Remember that na.action could for example replace missing

*> values by random imputations, or even multiple imputations,
**> and na.action comes quite late in the process *after* the
**> model frame has been formed. Unlike factors,
**> poly() etc generate multiple columns in the model frame, and
**> then subset and na.action are applied. Factors are encoded
**> in model.matrix (and that is really only used for the linear
**> parts of models, unlike model.frame).
**>
**> I am not sure you appreciate the basic point: poly is applied
**> to the whole dataset and not to a subset, and that can be
**> important (e.g. to ensure no unstable extrapolation when
**> predicting later).
*

Note, however, that a user could create the subset as a data frame through na.omit() prior to applying poly() -- as I did in the example above. In fact, that's probably what a user of poly() would now want to do.

> Really na.action has to come last, as

*> functions in the formula could themselves generate NAs
**> (log(0) for example).
**>
*

I did understand the sequence of operations, but didn't see the point about newly generated NAs. I agree that this makes it necessary to remove NAs last.

> There is an alternative, that poly() works only on finite

*> values and passes through non-finite ones, but it was
**> deliberately not written that way and you will need to
**> convince everyone that would be a better solution. (What
**> happens for example if there are only two finite values?) If
**> you want such a function, it is easy for you to provide it:
**> just please don't call it poly(). [Writing poly() was not
**> easy BTW, but poly_allow_NAs would be given poly.]
**>
*

I'm no longer convinced that what I proposed is a better solution, but the current situation seems problematic to me as well.

Thanks for the extended explanation.

John

*> --
*

> Brian D. Ripley, ripley@stats.ox.ac.uk

*> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
**> University of Oxford, Tel: +44 1865 272861 (self)
**> 1 South Parks Road, +44 1865 272866 (PA)
**> Oxford OX1 3TG, UK Fax: +44 1865 272595
**>
**> ______________________________________________
**> R-devel@stat.math.ethz.ch mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-devel
*

R-devel@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Feb 19 23:43:31 2005

*
This archive was generated by hypermail 2.1.8
: Sun 20 Feb 2005 - 01:30:26 EST
*