From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>

Date: Thu 24 Aug 2006 - 02:15:47 EST

Date: Thu 24 Aug 2006 - 02:15:47 EST

On Wed, 23 Aug 2006, Frank E Harrell Jr wrote:

> Thomas Lumley wrote:

*> > On Wed, 23 Aug 2006, Damien Moore wrote:
**> >
**> > > Thomas Lumley wrote:
**> > >
**> > > > No, it is quite straightforward if you are willing to make multiple
**> > > > passes
**> > > > through the data. It is hard with a single pass and may not be possible
**> > > > unless the data are in random order.
**> > > >
**> > > > Fisher scoring for glms is just an iterative weighted least squares
**> > > > calculation using a set of 'working' weights and 'working' response.
**> > > > These
**> > > > can be defined chunk by chunk and fed to biglm. Three iterations should
**> > > > be sufficient.
**> > > (NB: Although not stated clearly I was referring to single pass when I
**> > > wrote "impossible"). Doing as you suggest with multiple passes would
**> > > entail either sticking the database input calls into the main iterative
**> > > loop of a lookalike glm.fit or lumping the user with a very unattractive
**> > > sequence of calls:
**> >
**> > I have written most of a bigglm() function where the data= argument is a
**> > function with a single argument 'reset'. When called with reset=FALSE the
**> > function should return another chunk of data, or NULL if no data are
**> > available, and when called with reset=TRUE it should go back to the
**> > beginning of the data. I don't think this is too inelegant.
**> >
**> > In general I don't think a one-pass algorithm is possible. If the data are
**> > in random order then you could read one chunk, fit a glm, and set up a grid
**> > of coefficient values around the estimate. You then read the rest of the
**> > data, computing the loglikelihood and score function at each point in the
**> > grid. After reading all the data you can then fit a suitable smooth surface
**> > to the loglikelihood. I don't know whether this will give sufficient
**> > accuracy, though.
*

Not in general. One of the problems with a binomial/Poisson glm is the geometry of the likelihood can be radically changed by a single case: suppose that the initial sample were separable? Misclassifications can really get you: one case with an incorrect label can contribute arbitrarily much to the log-likelihood (and I have seen 11,000 units).

> > For really big data sets you are probably better off with the approach that

*> > Brian Ripley and Fei Chen used -- they have shown that it works and there
**> > unlikely to be anything much simpler that also works that they missed.
**>
**> What I would like to see someone work on is a kind of SQL code generator that
**> given a set of weights passes through the database and computes a new weighted
**> information matrix. The code generator would make the design matrix a
**> symbolic entity. SQL or other suitable framework would return the p x p
**> matrix for one iteration at a time.
*

That is one of the things Fei Chen and I did (using SQL extensions in MySQL), and probably the most successful. We managed to explore models with tens of millions of cases with 30 categorical explanatory vars (and the data structure and problem was such that this was worthwhile, not least because predictions might be needed for 0.01% of the coverage).

-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.Received on Thu Aug 24 02:30:52 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Thu 24 Aug 2006 - 04:22:28 EST.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*