[Rd] problem with zero-weighted observations in predict.lm?

From: William Dunlap <wdunlap_at_tibco.com>
Date: Tue, 27 Jul 2010 11:48:56 -0700


In modelling functions some people like to use a weight of 0 to drop an observation instead of using a subset value of FALSE. E.g.,
  weights=c(0,1,1,...)
instead of
  subset=c(FALSE, TRUE, TRUE, ...)
to drop the first observation.

lm() and summary.lm() appear to treat these in the same way, decrementing the number of degrees of freedom for each dropped observation. However, predict.lm() does not treat them the same. It doesn't seem to diminish the df to account for the 0-weighted observations. E.g., the last printout from the following script is as follows, where predw is the prediction from the fit that used 0-weights and preds is from using FALSE's in the subset argument. Is this difference proper?

 predw                        preds                       

$fit $fit
fit lwr upr fit lwr upr
 1 1.544302 1.389254 1.699350 1 1.544302 1.194879 1.893724  2 1.935504 1.719482 2.151526 2 1.935504 1.448667 2.422341
                                                          

$se.fit $se.fit
1 2 1 2 0.06723657 0.09367810 0.1097969 0.1529757
$df $df
[1] 8 [1] 3
$residual.scale $residual.scale
[1] 0.1031362 [1] 0.1684207

### start of script ###

data <- data.frame(x=1:10, y=log(1:10))
fitw <- lm(data=data, y~x, weights=as.numeric(x<=5))
fits <- lm(data=data, y~x, subset=x<=5)

fitw$df.residual == 3 && fits$df.residual == 3 # TRUE identical(coef(fitw), coef(fits)) # TRUE

sumw <- summary(fitw)
sums <- summary(fits)
identical(sumw$df, sums$df) # TRUE

predw <- predict(fitw, interval="confidence",

         se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5))) preds <- predict(fits, interval="confidence",

         se.fit=TRUE, newdata=data.frame(x=c(4.5,5.5))) all.equal(predw, preds) # lots of differences

sideBySide <- function (a, b, argNames)
{

    # print objects a and b side by side     oldWidth <- options(width = getOption("width")/2 - 4)     on.exit(options(oldWidth))
    if (missing(argNames)) {

        argNames <- c(deparse(substitute(a))[1],
                      deparse(substitute(b))[1])
    }
    pa <- capture.output(print(a))
    pb <- capture.output(print(b))
    nlines <- max(length(pa), length(pb))
    length(pa) <- nlines
    length(pb) <- nlines
    pb[is.na(pb)] <- ""
    pa[is.na(pa)] <- ""
    retval <- cbind(pa, pb, deparse.level = 0)     dimnames(retval) <- list(rep("",nrow(retval)), argNames)     noquote(retval)
}

# lwr, upr, se.fit, df, residual.scale differ sideBySide(predw, preds)

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 27 Jul 2010 - 18:52:33 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 28 Jul 2010 - 09:20:20 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive