Re: [R] NaN when using dffits, stemming from lm.influence call

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Thu 31 Aug 2006 - 22:51:40 EST

When applying dffits to a GLM, you are presumably intending to apply it to the working regression. However, that is not what lm.influence has long been set up to do (it uses the deviance residuals).

In your example the influence is high and so the approximations of applying the standard formulae to the working regression are unrealistic, hence the NaN (it is predicting a negative variance on deleting that case, which is a failure of the approximation used).

If you look at the reference (Williams) you will find discussion of the approxmations and indeed what is possibly a better approximation in his 'likelihood residuals'.

On Thu, 31 Aug 2006, Peter Dunn wrote:

> Hi all
>
> I'm getting a NaN returned on using dffits, as explained
> below. To me, there seems no obvious (or non-obvious reason
> for that matter) reason why a NaN appears.
>
> Before I start digging further, can anyone see why dffits
> might be failing? Is there a problem with the data?
>
>
> Consider:
>
> # Load data
> dep <-
> read.table("http://www.sci.usq.edu.au/staff/dunn/Datasets/Books/Hand/Hand-R/factor1-R.dat",
> header=TRUE)
> attach(dep)
> dep
>
> # Fit Poisson glm
> dep.glm2 <- glm( Counts ~ factor(Depression) + factor(SLE) + factor(Children)
> + factor(Depression):factor(SLE),
> family=poisson( link=log) )
>
> # Compute dffits
> dffits( dep.glm2 )
>
>
> This produces the output:
> 1 2 3 4 5 6 1.4207746
> -0.1513808 NaN 0.9079142 -0.1032664 -1.0860289
> 7 8
> 0.4853797 3.8560863
>
> NaN exists for Observation 3. I cannot understand why: there's
> nothing grossly unusual or bad about it. I look a bit closer,
> and it falls over in lm.influence when computing the deletion
> statistic sigma:
>
>
>
> > lm.influence(dep.glm2)$sigma
> 1 2 3 4 5 6 7 8
> 0.914829 2.134279 NaN 2.186707 2.224885 1.934539 2.225115 1.957111
>
> The rest of the results from lm.influence are OK; for example:
>
> > lm.influence(dep.glm2)$wt.res
> 1 2 3 4 5 6
> 2.62840627 -0.88476903 -1.09492912 0.20247856 -0.23114458 -0.95123387
> 7 8
> 0.07521515 0.30208051
>
>
> Use of debug( lm.influence ) indicates the NaN appears in this line
> of lm.influence:
>
>
>
> res <- .Fortran("lminfl", model$qr$qr, n, n, k, as.integer(do.coef),
> model$qr$qraux, wt.res = e, hat = double(n), coefficients = if (do.coef)
> matrix(0,
> n, k) else double(0), sigma = double(n), tol = 10 *
> .Machine$double.eps,
> DUP = FALSE, PACKAGE = "base")[c("hat", "coefficients", "sigma",
> "wt.res")]
>
>
> I don't particularly wish to dig around in the Fortran if someone
> else can look at it and see my problem easily. But if I must...
>
>
>
> The appearance of the NaN seems odd, since (as I understand it)
> lm.influence(dep.glm2)$sigma computes sigma when each observation
> is removed in turn. So if I remove Observation 3 and try fitting the
> model, there are no problems or complaints:
>
> dep.glm3 <- glm( Counts ~ factor(Depression) + factor(SLE) +
> factor(Children) + factor(Depression):factor(SLE),
> family=poisson( link=log), subset=(-3) )
>
>
> This produces:
>
>
> > dep.glm3
>
> Call: glm(formula = Counts ~ factor(Depression) + factor(SLE) +
> factor(Children) + factor(Depression):factor(SLE), family = poisson(link
> = log), subset = (-3))
>
> Coefficients:
> (Intercept) factor(Depression)1
> 5.4389 -4.1392
> factor(SLE)1 factor(Children)1
> -0.6503 -2.4036
> factor(Depression)1:factor(SLE)1
> 3.9513
>
> Degrees of Freedom: 6 Total (i.e. Null); 2 Residual
> Null Deviance: 695.9
> Residual Deviance: 0.8535 AIC: 41.25
>
>
> No problems, errors, or any signs of potential problems.
>
>
> In the changes to R 2.3.0 (in the NEWS file,
> eg http://mirror.aarnet.edu.au/pub/CRAN/src/base/NEWS)
> I find this:
>
> o Influence measures such as rstandard() and cooks.distance()
> could return infinite values rather than NaN for a case which
> was fitted exactly. Similarly, plot.lm() could fail on such
> examples. plot.lm(which = 5) had to be modified to only plot
> cases with hat < 1. (PR#8367)
>
> lm.influence() was incorrectly reporting 'coefficients' and
> 'sigma' as NaN for cases with hat = 1, and on some platforms
> not detecting hat = 1 correctly.
>
> The last sentence identifies NaN being reported for sigma, as I
> find with my data. But my data do not have hat = 1, but the hat
> diagonals are large. The troublesome Observation 3 does not have the
> largest hat value in the data though:
>
> > hatvalues(dep.glm2)
> 1 2 3 4 5 6 7
> 8
> 0.1689061 0.1064651 0.9098542 0.9030814 0.3799079 0.6382790 0.9327408
> 0.9607654
>
> And besides, I am using the most recent version of R (see below). BTW,
> the NaNs appear in the previous version of R also.
>
> So I'm flummoxed.
>
> As always, help appreciated.
>
> P.
>
>
>
> > version
> _
> platform i386-pc-linux-gnu
> arch i386
> os linux-gnu
> system i386, linux-gnu
> status
> major 2
> minor 3.1
> year 2006
> month 06
> day 01
> svn rev 38247
> language R
> version.string Version 2.3.1 (2006-06-01)
>
>
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu Aug 31 22:56:49 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 01 Sep 2006 - 00:24:05 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.