Re: [R] Re : Large database help

From: Greg Snow <>
Date: Thu 18 May 2006 - 01:47:01 EST

Thanks for doing this Thomas, I have been thinking about what it would take to do this, but if it were left to me, it would have taken a lot longer.

Back in the 80's there was a statistical package called RUMMAGE that did all computations based on sufficient statistics and did not keep the actual data in memory. Memory for computers became cheap before datasets turned huge so there wasn't much demand for the program (and it never had a nice GUI to help make it popular). It looks like things are switching back to that model now though.

Here are a couple of thought that I had that maybe could help with some future development:

Another function that could be helpful is bigplot which I imagine would be best based on the hexbin package, just accumulating the counts in chunks like your biglm function. Once I see the code for biglm I may be able to contribute this piece. I guess bigbarplot and bigboxplot may also be useful (accumulating counts for the barplot will be easy, but does anyone have ideas on the best way to get quantiles for the boxplots efficiently (the best approach I can think of so far is to have the database sort the variables, but sorting tends to be slow)).

Another general approach that I thought of would be to read the data in in chunks, compute the statistic(s) of interest on each chunk (vector of coefficients for regression models) then average the estimates across chunks. Each chunk could be treated as a cluster in a cluster sample for the averaging and estimating variances for the estimates (if only we can get the author of the survey package involved :-). This would probably be less accurate than your biglm function for regression, but it would have the flavor of the bootstrapping routines in that it would work for many cases that don't have their own big methods written yet (logistic and other glm models, correlations, ...).

Any other thoughts anyone?

Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
(801) 408-8111

-----Original Message-----
[] On Behalf Of Thomas Lumley
Sent: Tuesday, May 16, 2006 3:40 PM
To: roger koenker
Cc: r-help list; Robert Citek
Subject: Re: [R] Re : Large database help

On Tue, 16 May 2006, roger koenker wrote:

> In ancient times, 1999 or so, Alvaro Novo and I experimented with an
> interface to mysql that brought chunks of data into R and accumulated
> results.
> This is still described and available on the web in its original form
> at
> Despite claims of "future developments" nothing emerged, so anyone
> considering further explorations with it may need training in
> Rchaeology.
A few hours ago I submitted to CRAN a package "biglm" that does large linear regression models using a similar strategy (it uses incremental QR decomposition rather than accumalating the crossproduct matrix). It also computes the Huber/White sandwich variance estimate in the same single pass over the data. Assuming I haven't messed up the package checking it will appear in the next couple of day on CRAN. The syntax looks like a <- biglm(log(Volume) ~ log(Girth) + log(Height), chunk1) a <- update(a, chunk2) a <- update(a, chunk3) summary(a) where chunk1, chunk2, chunk3 are chunks of the data. -thomas ______________________________________________ mailing list PLEASE do read the posting guide! ______________________________________________ mailing list PLEASE do read the posting guide!
Received on Thu May 18 01:59:52 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 18 May 2006 - 02:10:09 EST.

Mailing list information is available at Please read the posting guide before posting to the list.