# Re: [R] Maximum number of variables allowed in a multiple linearregression model

From: Tony Plate <tplate_at_acm.org>
Date: Wed, 06 Feb 2008 10:28:34 -0700

Bert Gunter wrote:
> I strongly suggest you collaborate with a local statistician. I can think of
> no circumstance where multiple regression on "hundreds of thousands of
> variables" is anything more than a fancy random number generator.

That sounds like a challenge! What is the largest regression problem (in terms of numbers of variables) that people have encountered where it made sense to do some sort of linear regression (and gave useful results)? (Including multilevel and Bayesian techniques.)

R version 2.6.1 (2007-11-26)
Copyright (C) 2007 The R Foundation for Statistical Computing ISBN 3-900051-07-0

``` > m <- 3000
> n <- m * 10
> x <- matrix(rnorm(n*m), ncol=m, nrow=n,
```
dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))  > dim(x)
[1] 30000 3000
``` > k <- sample(m, 10)
> y <- rowSums(x[,k]) + 10 * rnorm(n)
> fit <- lm.fit(y=y, x=x)
```

Error: cannot allocate vector of size 686.6 Mb  > object.size(x)/2^20
[1] 687.7787
> memory.size()
[1] -2022.552
>
and the Windows process monitor shows the peak memory usage for Rgui.exe at 2,137,923K. But in a 64 bit version of R, I would be surprised if it was not possible to run this (given sufficient memory).

However, R easily handles a slightly smaller problem:

``` > m <- 1000 # of variables
> n <- m * 10 # of rows
> k <- sample(m, 10)
> x <- matrix(rnorm(n*m), ncol=m, nrow=n,
```
dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
``` > y <- rowSums(x[,k]) + 10 * rnorm(n)
> fit <- lm.fit(y=y, x=x)
> # distribution of coefs that should be one vs zero
> round(rbind(one=quantile(fit\$coefficients[k]),
zero=quantile(fit\$coefficients[-k])), digits=2)
0%   25%   50%  75% 100%
```

one 0.94 0.98 1.04 1.10 1.18
zero -0.30 -0.08 -0.01 0.06 0.29
>

To echo Bert Gunter's cautions, one must be careful doing ordinary linear regression with large numbers of coefficients. It does seem a little unlikely that there is sufficient data to get useful estimates of three thousand coefficients using linear regression in data managed in Excel (though I guess it could be possible using Excel 12.0, which can handle up to 1 million rows - recent versions prior to 2008 could handle on 64K rows - see http://en.wikipedia.org/wiki/Microsoft_Excel#Versions ). So, the suggestion to consult a local statistician is good advice - there may be other more suitable approaches, and if some form of linear regression is an appropriate approach, there are things to do to gain confidence that the results of the linear regression convey useful information.

Tony Plate

>
