# Re: [R] Maximum number of variables allowed in a multiple linearregression model

From: Douglas Bates <bates_at_stat.wisc.edu>
Date: Wed, 6 Feb 2008 19:53:42 -0600

On Feb 6, 2008 11:28 AM, Tony Plate <tplate_at_acm.org> wrote:
> Bert Gunter wrote:
> > I strongly suggest you collaborate with a local statistician. I can think of
> > no circumstance where multiple regression on "hundreds of thousands of
> > variables" is anything more than a fancy random number generator.
>
> That sounds like a challenge! What is the largest regression problem (in
> terms of numbers of variables) that people have encountered where it made
> sense to do some sort of linear regression (and gave useful results)?
> (Including multilevel and Bayesian techniques.)

I have fit linear and generalized linear models with hundreds of thousands of coefficients but, of course, with a highly structured model matrix and using sparse matrix techniques. What is called the Rasch model for analysis of item response data (e.g. correct/incorrect responses by students to the items on a multiple-choice test) is a generalized linear model with the students and the items as factors.

However, like Bert I would be very dubious of any attempt to fit a linear regression model to 3000 variables that were not generated in a systematic way. Sounds like a massive, computer-fueled fishing expedition (a.k.a. "data mining").

> However, the original poster did say "hundreds to thousands", which is
> smaller than "hundreds of thousands". When I try a regression problem with
> 3,000 coefficients in R running under Windows XP 64 bit with 8Gb of memory
> on the machine and the /3Gb option active (i.e., R can get up to 3Gb), R
> 2.6.1 runs out of memory (apparently trying to duplicate the model matrix):
>
> R version 2.6.1 (2007-11-26)
> Copyright (C) 2007 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
>
> > m <- 3000
> > n <- m * 10
> > x <- matrix(rnorm(n*m), ncol=m, nrow=n,
> dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
> > dim(x)
>  30000 3000
> > k <- sample(m, 10)
> > y <- rowSums(x[,k]) + 10 * rnorm(n)
> > fit <- lm.fit(y=y, x=x)
> Error: cannot allocate vector of size 686.6 Mb
> > object.size(x)/2^20

>  687.7787
> > memory.size()
>  -2022.552
> >
> and the Windows process monitor shows the peak memory usage for Rgui.exe at
> 2,137,923K. But in a 64 bit version of R, I would be surprised if it was
> not possible to run this (given sufficient memory).
>
> However, R easily handles a slightly smaller problem:
> > m <- 1000 # of variables
> > n <- m * 10 # of rows
> > k <- sample(m, 10)
> > x <- matrix(rnorm(n*m), ncol=m, nrow=n,
> dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
> > y <- rowSums(x[,k]) + 10 * rnorm(n)
> > fit <- lm.fit(y=y, x=x)
> > # distribution of coefs that should be one vs zero
> > round(rbind(one=quantile(fit\$coefficients[k]),
> zero=quantile(fit\$coefficients[-k])), digits=2)
> 0% 25% 50% 75% 100%
> one 0.94 0.98 1.04 1.10 1.18
> zero -0.30 -0.08 -0.01 0.06 0.29
> >
>
> To echo Bert Gunter's cautions, one must be careful doing ordinary linear
> regression with large numbers of coefficients. It does seem a little
> unlikely that there is sufficient data to get useful estimates of three
> thousand coefficients using linear regression in data managed in Excel
> (though I guess it could be possible using Excel 12.0, which can handle up
> to 1 million rows - recent versions prior to 2008 could handle on 64K rows
> - see http://en.wikipedia.org/wiki/Microsoft_Excel#Versions ). So, the
> suggestion to consult a local statistician is good advice - there may be
> other more suitable approaches, and if some form of linear regression is an
> appropriate approach, there are things to do to gain confidence that the
> results of the linear regression convey useful information.
>
> -- Tony Plate
>
>
> >
> > -- Bert Gunter
> > Genentech Nonclinical Statistics
> >
> > -----Original Message-----
> > From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On
> > Behalf Of Michelle Chu
> > Sent: Tuesday, February 05, 2008 9:00 AM
> > To: R-help_at_r-project.org
> > Subject: [R] Maximum number of variables allowed in a multiple
> > linearregression model
> >
> > Hi,
> >
> > I appreciate it if someone can confirm the maximum number of variables
> > allowed in a multiple linear regression model. Currently, I am looking for
> > a software with the capacity of handling approximately 3,000 variables. I
> > am using Excel to process the results. Any information for processing a
> > matrix from Excel with hundreds to thousands of variables will helpful.
> >
> > Best Regards,
> > Michelle
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help