[R] NaN when using dffits, stemming from lm.influence call

From: Peter Dunn <dunn_at_usq.edu.au>
Date: Thu 31 Aug 2006 - 12:18:52 EST


Hi all

I'm getting a NaN returned on using dffits, as explained below. To me, there seems no obvious (or non-obvious reason for that matter) reason why a NaN appears.

Before I start digging further, can anyone see why dffits might be failing? Is there a problem with the data?

Consider:

# Load data
dep <-
read.table("
http://www.sci.usq.edu.au/staff/dunn/Datasets/Books/Hand/Hand-R/factor1-R.dat",

    header=TRUE)
attach(dep)
dep

# Fit Poisson glm
dep.glm2 <- glm( Counts ~ factor(Depression) + factor(SLE) + factor(Children) + factor(Depression):factor(SLE),

    family=poisson( link=log) )

# Compute dffits
dffits( dep.glm2 )

This produces the output:

          1          2          3          4          5          6 
      1.4207746 -0.1513808        NaN  0.9079142 -0.1032664 -1.0860289
          7          8

0.4853797 3.8560863

NaN exists for Observation 3. I cannot understand why: there's nothing grossly unusual or bad about it. I look a bit closer, and it falls over in lm.influence when computing the deletion statistic sigma:

> lm.influence(dep.glm2)$sigma

        1        2        3        4        5        6        7        8
0.914829 2.134279      NaN 2.186707 2.224885 1.934539 2.225115 1.957111

The rest of the results from lm.influence are OK; for example:

> lm.influence(dep.glm2)$wt.res

           1 2 3 4 5 6   2.62840627 -0.88476903 -1.09492912 0.20247856 -0.23114458 -0.95123387

           7 8
  0.07521515 0.30208051

Use of debug( lm.influence ) indicates the NaN appears in this line of lm.influence:

res <- .Fortran("lminfl", model$qr$qr, n, n, k, as.integer(do.coef),

     model$qr$qraux, wt.res = e, hat = double(n), coefficients = if (do.coef) matrix(0,

         n, k) else double(0), sigma = double(n), tol = 10 * .Machine$double.eps,

     DUP = FALSE, PACKAGE = "base")[c("hat", "coefficients", "sigma",
     "wt.res")]


I don't particularly wish to dig around in the Fortran if someone else can look at it and see my problem easily. But if I must...

The appearance of the NaN seems odd, since (as I understand it)

   lm.influence(dep.glm2)$sigma computes sigma when each observation is removed in turn. So if I remove Observation 3 and try fitting the model, there are no problems or complaints:

dep.glm3 <- glm( Counts ~ factor(Depression) + factor(SLE) +

    factor(Children) + factor(Depression):factor(SLE),     family=poisson( link=log), subset=(-3) )

This produces:

> dep.glm3

Call: glm(formula = Counts ~ factor(Depression) + factor(SLE) +

factor(Children) +      factor(Depression):factor(SLE), family = 
poisson(link = log),      subset = (-3))

Coefficients:
                      (Intercept)               factor(Depression)1
                           5.4389                           -4.1392
                     factor(SLE)1                 factor(Children)1
                          -0.6503                           -2.4036
factor(Depression)1:factor(SLE)1
                           3.9513

Degrees of Freedom: 6 Total (i.e. Null);  2 Residual
Null Deviance:      695.9
Residual Deviance: 0.8535       AIC: 41.25


No problems, errors, or any signs of potential problems.

In the changes to R 2.3.0 (in the NEWS file, eg http://mirror.aarnet.edu.au/pub/CRAN/src/base/NEWS) I find this:

    o	Influence measures such as rstandard() and cooks.distance()
	could return infinite values rather than NaN for a case which
	was fitted exactly.  Similarly, plot.lm() could fail on such
	examples.  plot.lm(which = 5)  had to be modified to only plot
	cases with hat < 1.  (PR#8367)

	lm.influence() was incorrectly reporting 'coefficients' and
	'sigma' as NaN for cases with hat = 1, and on some platforms
	not detecting hat = 1 correctly.

The last sentence identifies NaN being reported for sigma, as I find with my data. But my data do not have hat = 1, but the hat diagonals are large. The troublesome Observation 3 does not have the largest hat value in the data though:

> hatvalues(dep.glm2)

         1         2         3         4         5         6         7 
        8

0.1689061 0.1064651 0.9098542 0.9030814 0.3799079 0.6382790 0.9327408 0.9607654

And besides, I am using the most recent version of R (see below). BTW, the NaNs appear in the previous version of R also.

So I'm flummoxed.

As always, help appreciated.

P.

> version

                _
platform       i386-pc-linux-gnu
arch           i386
os             linux-gnu
system         i386, linux-gnu
status
major          2
minor          3.1
year           2006
month          06
day            01
svn rev        38247
language       R

version.string Version 2.3.1 (2006-06-01)
-- 
Dr Peter Dunn  |  Email:  dunn <at> usq.edu.au
Faculty of Sciences, University of Southern Queensland
and the Australian Centre for Sustainable Catchments
CRICOS:  QLD 00244B |  NSW 02225M |  VIC 02387D |  WA 02521C

______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

Received on Thu Aug 31 13:24:29 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 01 Sep 2006 - 00:23:53 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.