Re: [Rd] problem with zero-weighted observations in predict.lm?

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Thu, 29 Jul 2010 07:34:26 +0100 (BST)

On Thu, 29 Jul 2010, Peter Dalgaard wrote:

> Peter Dalgaard wrote:

>> William Dunlap wrote:
>>> In modelling functions some people like to use
>>> a weight of 0 to drop an observation instead of
>>> using a subset value of FALSE. E.g.,
>>> weights=c(0,1,1,...)
>>> instead of
>>> subset=c(FALSE, TRUE, TRUE, ...)
>>> to drop the first observation.
>>>
>>> lm() and summary.lm() appear to treat these in the
>>> same way, decrementing the number of degrees of
>>> freedom for each dropped observation. However,
>>> predict.lm() does not treat them the same. It
>>> doesn't seem to diminish the df to account for the
>>> 0-weighted observations. E.g., the last printout
>>> from the following script is as follows, where
>>> predw is the prediction from the fit that used
>>> 0-weights and preds is from using FALSE's in the
>>> subset argument. Is this difference proper?
>>
>> Nice catch.
>>
>> The issue is that the subset fit and the zero-weighted fit are not
>> completely the same. Notice that the residuals vector has different
>> length in the two analyses. With a simplified setup:
>>
>>> length(lm(y~1,weights=w)$residuals)
>> [1] 10
>>> length(lm(y~1,subset=-1)$residuals)
>> [1] 9
>>> w
>> [1] 0 1 1 1 1 1 1 1 1 1
>>
>> This in turn is what confuses predict.lm because it gets n and residual
>> df from length(object$residuals). summary.lm() uses NROW(Qr$qr), and I
>> suppose that predict.lm should follow suit.
>>
>
> ...and then when I went to fix it, I found that the actual line in the
> sources (stats/R/lm.R) reads
>
> 27442     ripley     n <- length(object$residuals) # NROW(object$qr$qr)
>
> so it's been like that since December 2003. I wonder if Brian remembers
> what the point was? (27442 was the restructuring into the stats package,
> so it might not actually be Brian's code).

At least that wasn't the point of change: the code was the same in R 1.8.1 (pre-split).

I think you will find that 'n' is used in several ways in predict.lm, and since NA-handling was introduced in R 1.8.0 they may differ in value. So the safest route seems to be to change just 'n' in

                 df <- n - p

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Thu 29 Jul 2010 - 06:37:39 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 29 Jul 2010 - 08:20:20 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive