Re: [R] lean and mean lm/glm?

From: Thomas Lumley <>
Date: Thu 24 Aug 2006 - 01:25:54 EST

On Wed, 23 Aug 2006, Damien Moore wrote:

> Thomas Lumley wrote:
>> No, it is quite straightforward if you are willing to make multiple passes
>> through the data. It is hard with a single pass and may not be possible
>> unless the data are in random order.
>> Fisher scoring for glms is just an iterative weighted least squares
>> calculation using a set of 'working' weights and 'working' response. These
>> can be defined chunk by chunk and fed to biglm. Three iterations should
>> be sufficient.
> (NB: Although not stated clearly I was referring to single pass when I
> wrote "impossible"). Doing as you suggest with multiple passes would
> entail either sticking the database input calls into the main iterative
> loop of a lookalike or lumping the user with a very unattractive
> sequence of calls:

I have written most of a bigglm() function where the data= argument is a function with a single argument 'reset'. When called with reset=FALSE the function should return another chunk of data, or NULL if no data are available, and when called with reset=TRUE it should go back to the beginning of the data. I don't think this is too inelegant.

In general I don't think a one-pass algorithm is possible. If the data are in random order then you could read one chunk, fit a glm, and set up a grid of coefficient values around the estimate. You then read the rest of the data, computing the loglikelihood and score function at each point in the grid. After reading all the data you can then fit a suitable smooth surface to the loglikelihood. I don't know whether this will give sufficient accuracy, though.

For really big data sets you are probably better off with the approach that Brian Ripley and Fei Chen used -- they have shown that it works and there unlikely to be anything much simpler that also works that they missed.


Thomas Lumley			Assoc. Professor, Biostatistics	University of Washington, Seattle

______________________________________________ mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Thu Aug 24 03:12:14 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 24 Aug 2006 - 04:22:28 EST.

Mailing list information is available at Please read the posting guide before posting to the list.