Re: [R] Maximum number of variables allowed in a multiple linearregression model

From: Tony Plate <tplate_at_acm.org>
Date: Wed, 06 Feb 2008 10:28:34 -0700

Bert Gunter wrote:
> I strongly suggest you collaborate with a local statistician. I can think of
> no circumstance where multiple regression on "hundreds of thousands of
> variables" is anything more than a fancy random number generator.

That sounds like a challenge! What is the largest regression problem (in terms of numbers of variables) that people have encountered where it made sense to do some sort of linear regression (and gave useful results)? (Including multilevel and Bayesian techniques.)

However, the original poster did say "hundreds to thousands", which is smaller than "hundreds of thousands". When I try a regression problem with 3,000 coefficients in R running under Windows XP 64 bit with 8Gb of memory on the machine and the /3Gb option active (i.e., R can get up to 3Gb), R 2.6.1 runs out of memory (apparently trying to duplicate the model matrix):

R version 2.6.1 (2007-11-26)
Copyright (C) 2007 The R Foundation for Statistical Computing ISBN 3-900051-07-0

 > m <- 3000
 > n <- m * 10
 > x <- matrix(rnorm(n*m), ncol=m, nrow=n, 
dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))  > dim(x)
[1] 30000 3000
 > k <- sample(m, 10)
 > y <- rowSums(x[,k]) + 10 * rnorm(n)
 > fit <- lm.fit(y=y, x=x)

Error: cannot allocate vector of size 686.6 Mb  > object.size(x)/2^20
[1] 687.7787
 > memory.size()
[1] -2022.552
 >
and the Windows process monitor shows the peak memory usage for Rgui.exe at 2,137,923K. But in a 64 bit version of R, I would be surprised if it was not possible to run this (given sufficient memory).

However, R easily handles a slightly smaller problem:

 > m <- 1000 # of variables
 > n <- m * 10 # of rows
 > k <- sample(m, 10)
 > x <- matrix(rnorm(n*m), ncol=m, nrow=n, 
dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
 > y <- rowSums(x[,k]) + 10 * rnorm(n)
 > fit <- lm.fit(y=y, x=x)
 > # distribution of coefs that should be one vs zero
 > round(rbind(one=quantile(fit$coefficients[k]), 
zero=quantile(fit$coefficients[-k])), digits=2)
         0%   25%   50%  75% 100%

one 0.94 0.98 1.04 1.10 1.18
zero -0.30 -0.08 -0.01 0.06 0.29
 >

To echo Bert Gunter's cautions, one must be careful doing ordinary linear regression with large numbers of coefficients. It does seem a little unlikely that there is sufficient data to get useful estimates of three thousand coefficients using linear regression in data managed in Excel (though I guess it could be possible using Excel 12.0, which can handle up to 1 million rows - recent versions prior to 2008 could handle on 64K rows - see http://en.wikipedia.org/wiki/Microsoft_Excel#Versions ). So, the suggestion to consult a local statistician is good advice - there may be other more suitable approaches, and if some form of linear regression is an appropriate approach, there are things to do to gain confidence that the results of the linear regression convey useful information.

>
> -- Bert Gunter
> Genentech Nonclinical Statistics
>
> -----Original Message-----
> From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On
> Behalf Of Michelle Chu
> Sent: Tuesday, February 05, 2008 9:00 AM
> To: R-help_at_r-project.org
> Subject: [R] Maximum number of variables allowed in a multiple
> linearregression model
>
> Hi,
>
> I appreciate it if someone can confirm the maximum number of variables
> allowed in a multiple linear regression model. Currently, I am looking for
> a software with the capacity of handling approximately 3,000 variables. I
> am using Excel to process the results. Any information for processing a
> matrix from Excel with hundreds to thousands of variables will helpful.
>
> Best Regards,
> Michelle
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 06 Feb 2008 - 17:34:10 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 07 Feb 2008 - 03:30:12 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive