R-alpha: glm and substitute bugs

Ross Ihaka (ihaka@stat.auckland.ac.nz)
Tue, 14 Jan 1997 15:30:49 +1300 (NZDT)

From: Ross Ihaka <ihaka@stat.auckland.ac.nz>
Date: Tue, 14 Jan 1997 15:30:49 +1300 (NZDT)
Message-Id: <199701140230.PAA27335@stat13.stat.auckland.ac.nz>
To: Jim Lindsey <jlindsey@luc.ac.be>
Subject: R-alpha: glm and substitute bugs
In-Reply-To: <9701130853.AA01242@alpha.luc.ac.be>

Jim Lindsey writes:
 > 1. Kurt Hornik has asked me to look at a  data set where glm with
 > binomial gives all zero coefficients and S-Plus does not. It turns out
 > that R gives all zero coefficients (except the intercept) for all
 > binomial data sets! I have traced the problem to initialize in the
 > binomial family which is not properly evaluated. Consider the
 > following:
 > > substitute(y[1,2])
 > y[1,2] 
 > > substitute(y[,2])
 > y[2] 
 > I do not know how long this problem has existed because I have never
 > before used binomial, only Poisson. (GLIM4 is about 100 times faster,
 > and  even direct minimization with nlm is considerably faster than
 > glm.)

Ouch!  The problem with substitute has been there for a long time.
There is a simple fix, but like all these things, the question is what
will break if we make the change.

For any brave souls who want to experiment, the problem is in
coerce.c in the function substituteList.  The lines

	else if(CAR(el) == R_MissingArg) {
		return substituteList(CDR(el), rho);

drop any missing arguments from the expression being substituted.
I think (no promises) that these can safely be deleted.

We'd be happy to insert an new glm module if any kind soul would like
to provide one :).  At least we're faster than Splus.

 > 2. While tracing that problem down, I discovered that Ross's changes
 > to glm have introduced a new error. He has incorrectly changed the
 > initial values of eta so that IWLS no longer always converges (and
 > usually more slowly). Here is an example that does not converge with
 > the new initial estimates, giving completely wild values, but does
 > converge correctly with the old:
 > y <- c(3,5,4,3,11,20,31,40,43,48,48,50)
 > glm(y~1,family=poisson)
 > The initialization of eta in glm.fit should be put back the way it was
 > before:
 > start <- qr.coef(fit, w*z)

Which gets back to the original problem of how to handle singular
design matrices.  The problem is that when the design is singular,
the coefficients come back with NAs in positions corresponding to
dependent columns.  This makes all the starting values be NA which is
a disaster.  Another possible fix is to add
	start[is.na(start)] <- 0
immediately after
	start <- qr.coef(fit, w*z)

BUT there are some other problems when the design matrix is singular.
It may be time to bite the bullet and really redo glm.

r-testers mailing list -- For info or help, send "info" or "help",
To [un]subscribe, send "[un]subscribe"
(in the "body", not the subject !)  To: r-testers-request@stat.math.ethz.ch