Re: [R] lean and mean lm/glm?

From: Damien Moore <damien.moore_at_excite.com>
Date: Thu 24 Aug 2006 - 01:06:43 EST

Thomas Lumley wrote:

> No, it is quite straightforward if you are willing to make multiple passes
> through the data. It is hard with a single pass and may not be possible
> unless the data are in random order.
>
> Fisher scoring for glms is just an iterative weighted least squares
> calculation using a set of 'working' weights and 'working' response. These
> can be defined chunk by chunk and fed to biglm. Three iterations should
> be sufficient.

(NB: Although not stated clearly I was referring to single pass when I wrote "impossible"). Doing as you suggest with multiple passes would entail either sticking the database input calls into the main iterative loop of a lookalike glm.fit or lumping the user with a very unattractive sequence of calls:

big_glm.init
iterate:
load_data_chunk
big_glm.newiter
iterate: #could use a subset of the chunks on the first few go rounds load_data_chunk
update.big_glm
big_glm.check_convergence #would also need to do coefficient adjustments if convergence is failing

Because most (if not all) of my data can fit into memory anyway, I propose simply doing the calcs in a modified glm.fit in chunks (i.e. by subsetting the X and y data matrices within the loops) with a user defined chunk length. I can always add database input calls later to handle exceptionally large datasets.

If one of you has a better suggestion I'm willing to hear it.

So far, I have hacked out a lot of the (in my view) extraneous stuff from glm and halved its memory usage. I can now run a 12 variable, 1 million observation data set using "only" 200Mb of working memory (excluding the memory required to store the data). Previously fit.glm was using 500Mb to do the same. To get convergence took 9 iterations (either way). To reiterate: the inefficiency is in calculating estimates, not in storing data.



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu Aug 24 01:13:16 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 24 Aug 2006 - 04:21:53 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.