Re: [R] lean and mean lm/glm?

From: Frank E Harrell Jr <>
Date: Thu 24 Aug 2006 - 01:54:04 EST

Thomas Lumley wrote:
> On Wed, 23 Aug 2006, Damien Moore wrote:

>> Thomas Lumley wrote:
>>> No, it is quite straightforward if you are willing to make multiple passes
>>> through the data. It is hard with a single pass and may not be possible
>>> unless the data are in random order.
>>> Fisher scoring for glms is just an iterative weighted least squares
>>> calculation using a set of 'working' weights and 'working' response. These
>>> can be defined chunk by chunk and fed to biglm. Three iterations should
>>> be sufficient.
>> (NB: Although not stated clearly I was referring to single pass when I 
>> wrote "impossible"). Doing as you suggest with multiple passes would 
>> entail either sticking the database input calls into the main iterative 
>> loop of a lookalike or lumping the user with a very unattractive 
>> sequence of calls:

> I have written most of a bigglm() function where the data= argument is a
> function with a single argument 'reset'. When called with reset=FALSE the
> function should return another chunk of data, or NULL if no data are
> available, and when called with reset=TRUE it should go back to the
> beginning of the data. I don't think this is too inelegant.

> In general I don't think a one-pass algorithm is possible. If the data are
> in random order then you could read one chunk, fit a glm, and set up a
> grid of coefficient values around the estimate. You then read the rest of
> the data, computing the loglikelihood and score function at each point in
> the grid. After reading all the data you can then fit a suitable smooth
> surface to the loglikelihood. I don't know whether this will give
> sufficient accuracy, though.
> For really big data sets you are probably better off with the approach
> that Brian Ripley and Fei Chen used -- they have shown that it works and
> there unlikely to be anything much simpler that also works that they
> missed.
> -thomas
> Thomas Lumley Assoc. Professor, Biostatistics
> University of Washington, Seattle

What I would like to see someone work on is a kind of SQL code generator that given a set of weights passes through the database and computes a new weighted information matrix. The code generator would make the design matrix a symbolic entity. SQL or other suitable framework would return the p x p matrix for one iteration at a time.


Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________ mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Thu Aug 24 02:06:01 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 24 Aug 2006 - 04:22:27 EST.

Mailing list information is available at Please read the posting guide before posting to the list.