From: Greg Snow <Greg.Snow_at_intermountainmail.org>

Date: Tue 22 Aug 2006 - 04:01:06 EST

Date: Tue 22 Aug 2006 - 04:01:06 EST

For very large regression problems there is the biglm package (put you
data into a database, read in 500,000 rows at a time, and keep updating
the fit).

This has not been extended to glm yet.

Hope this helps,

-- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow@intermountainmail.org (801) 408-8111 -----Original Message----- From: r-help-bounces@stat.math.ethz.ch [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Damien Moore Sent: Monday, August 21, 2006 11:49 AM To: r-help@stat.math.ethz.ch Subject: [R] lean and mean lm/glm? Hi All: I'm new to R and have a few questions about getting R to run efficiently with large datasets. I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the usual windows overhead). I have a dataset that has 4 million observations and about 20 variables. I want to run probit regressions on this data, but can't do this with more than about 500,000 observations before I start running out of ram (you could argue that I'm getting sufficient precision with <500,000 obs but lets pretend otherwise). Loading 500,000 observations into a data frame only takes about 100Mb of ram, so that isn't the problem. Instead it seems R uses huge amount of memory when running the glm methods. I called the Fortran routines that lm and glm use directly but even they create a large number of extraneous variables in the output (e.g. the Xs, ys, residuals etc) and during processing. For instance (sample code) x=runif(1000000) y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the 7mb of ram required to store y during processing, but cleans up ok afterwards with a gc() call X=cbind(x) p=ncol(X) n=NROW(y) ny=NCOL(y) tol=1e-7 #this is the fortran routine called by lm - regressing y on X here z <- .Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol = as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects = y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2 * p), PACKAGE = "base") This code runs very quickly - suggesting that in principle R should have no problem at all handling very large data sets, but uses >100mb during processing and z is about a 20mb object. Scaling this up to a much larger dataset with many variables its easy to see i'm going to run into problems My questions: 1. are there any memory efficient alternatives to lm/glm in R? 2. is there any way to prevent the Fortran routine "dqrls" from producing so much output? (I suspect not since its output has to be compatible with the summary method, which seems to rely on having a copy of all variables instead of just references to the relevant variables - correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it be to create new versions of lm and glm that don't use so much memory? (Not that I'm volunteering or anything ;) ). There is no need to hold individual residuals in memory or make copies of the variables (at least for my purposes). How well documented is the source code? cheers Damien Moore ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.Received on Tue Aug 22 05:42:32 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Tue 22 Aug 2006 - 08:22:42 EST.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*