Re: [R] Can we do GLM on 2GB data set with R?

From: Prof Brian Ripley <>
Date: Sun 21 Jan 2007 - 09:13:57 GMT

'given sufficient hardware' and a suitable OS, 'yes'.

You will see quoted on this list from time to time:

> library(fortunes)
> fortune("Yoda")

Evelyn Hall: I would like to know how (if) I can extract some of the information from the summary of my nlme. Simon Blomberg: This is R. There is no if. Only how.

You then mention 'my PC'. If your 'PC' is running Windows, the answer is 'with some work', since we don't have a version of R for Win64 and Win32 is limited to 3GB user address space.

To be more precise, we would have to know more about your GLM (and even if you mean GLM in the commonly accepted sense or the SASism with a redundant G), including what the variables are (I guess categorical stored as small integers?)

My DPhil student Fei Chen looked at ways of applying R to large GLMs with data stored in a MySQL database. His tests were about 4 years ago on 32-bit Linux, and he was able to run about 1 million cases on 30 categorical (mainly binary) variables with (I think) up to 5-way interactions. That is a very large GLM problem, and it is unusual for it to be worth fitting a (mainly linear) model with over 10,000 cases. (Also, there are normally problems with the homogeneity of very large datasets that taint the independence assumptions made by GLMs.)

My guess is that you have been considering the function glm(). There is function bigglm() in package biglm (by Thomas Lumley). I don't think you would be able even to load your data into 32-bit R, but it would be possible to use the ideas behind bigglm (which was one of the approaches Fei assessed) and perhaps even bigglm itself with one of the DBMS interfaces to R to retrieve data in chunks. (bigglm uses chunks of rows, but chunks of columns may be more efficient.)

Another possibility is that you want to fit a log-linear model to purely categorical data, and could make use of loglin(). That will be more efficient if the contingency table is densely populated.

My experience suggests that the important issues here are likely to be statistical rather than computational, and this is more a topic for a consultant than volunteer help on a discussion list.

On Sat, 20 Jan 2007, WILLIE, JILL wrote:

> We are wanting to use R instead of/in addition to our existing stats
> package because of it's huge assortment of stat functions. But, we
> routinely need to fit GLM models to files that are approximately 2-4GB
> (as SQL tables, un-indexed, w/tinyint-sized fields except for the
> response & weight variables). Is this feasible, does anybody know,
> given sufficient hardware, using R? It appears to use a great deal of
> memory on the small files I've tested.
> I've read the data import, memory.limit, memory.size & general
> documentation but can't seem to find a way to tell what the boundaries
> are & roughly gauge the needed memory...other than trial & error. I've
> started by testing the data.frame & run out of memory on my PC. I'm new
> to R so please be forgiving if this is a poorly-worded question.
> Jill Willie
> Open Seas
> Safeco Insurance
> 206-545-5673

Brian D. Ripley,        
Professor of Applied Statistics,
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Sun Jan 21 20:19:36 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sun 21 Jan 2007 - 10:30:22 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.