From: Douglas Bates <bates_at_stat.wisc.edu>

Date: Wed, 6 Feb 2008 19:53:42 -0600

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 07 Feb 2008 - 01:59:10 GMT

Date: Wed, 6 Feb 2008 19:53:42 -0600

On Feb 6, 2008 11:28 AM, Tony Plate <tplate_at_acm.org> wrote:

> Bert Gunter wrote:

*> > I strongly suggest you collaborate with a local statistician. I can think of
**> > no circumstance where multiple regression on "hundreds of thousands of
**> > variables" is anything more than a fancy random number generator.
**>
**> That sounds like a challenge! What is the largest regression problem (in
**> terms of numbers of variables) that people have encountered where it made
**> sense to do some sort of linear regression (and gave useful results)?
**> (Including multilevel and Bayesian techniques.)
*

I have fit linear and generalized linear models with hundreds of thousands of coefficients but, of course, with a highly structured model matrix and using sparse matrix techniques. What is called the Rasch model for analysis of item response data (e.g. correct/incorrect responses by students to the items on a multiple-choice test) is a generalized linear model with the students and the items as factors.

However, like Bert I would be very dubious of any attempt to fit a linear regression model to 3000 variables that were not generated in a systematic way. Sounds like a massive, computer-fueled fishing expedition (a.k.a. "data mining").

> However, the original poster did say "hundreds to thousands", which is

*> smaller than "hundreds of thousands". When I try a regression problem with
**> 3,000 coefficients in R running under Windows XP 64 bit with 8Gb of memory
**> on the machine and the /3Gb option active (i.e., R can get up to 3Gb), R
**> 2.6.1 runs out of memory (apparently trying to duplicate the model matrix):
**>
**> R version 2.6.1 (2007-11-26)
**> Copyright (C) 2007 The R Foundation for Statistical Computing
**> ISBN 3-900051-07-0
**>
**> > m <- 3000
**> > n <- m * 10
**> > x <- matrix(rnorm(n*m), ncol=m, nrow=n,
**> dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
**> > dim(x)
**> [1] 30000 3000
**> > k <- sample(m, 10)
**> > y <- rowSums(x[,k]) + 10 * rnorm(n)
**> > fit <- lm.fit(y=y, x=x)
**> Error: cannot allocate vector of size 686.6 Mb
**> > object.size(x)/2^20
**> [1] 687.7787
**> > memory.size()
**> [1] -2022.552
**> >
**> and the Windows process monitor shows the peak memory usage for Rgui.exe at
**> 2,137,923K. But in a 64 bit version of R, I would be surprised if it was
**> not possible to run this (given sufficient memory).
**>
**> However, R easily handles a slightly smaller problem:
**> > m <- 1000 # of variables
**> > n <- m * 10 # of rows
**> > k <- sample(m, 10)
**> > x <- matrix(rnorm(n*m), ncol=m, nrow=n,
**> dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
**> > y <- rowSums(x[,k]) + 10 * rnorm(n)
**> > fit <- lm.fit(y=y, x=x)
**> > # distribution of coefs that should be one vs zero
**> > round(rbind(one=quantile(fit$coefficients[k]),
**> zero=quantile(fit$coefficients[-k])), digits=2)
**> 0% 25% 50% 75% 100%
**> one 0.94 0.98 1.04 1.10 1.18
**> zero -0.30 -0.08 -0.01 0.06 0.29
**> >
**>
**> To echo Bert Gunter's cautions, one must be careful doing ordinary linear
**> regression with large numbers of coefficients. It does seem a little
**> unlikely that there is sufficient data to get useful estimates of three
**> thousand coefficients using linear regression in data managed in Excel
**> (though I guess it could be possible using Excel 12.0, which can handle up
**> to 1 million rows - recent versions prior to 2008 could handle on 64K rows
**> - see http://en.wikipedia.org/wiki/Microsoft_Excel#Versions ). So, the
**> suggestion to consult a local statistician is good advice - there may be
**> other more suitable approaches, and if some form of linear regression is an
**> appropriate approach, there are things to do to gain confidence that the
**> results of the linear regression convey useful information.
**>
**> -- Tony Plate
**>
**>
**> >
**> > -- Bert Gunter
**> > Genentech Nonclinical Statistics
**> >
**> > -----Original Message-----
**> > From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On
**> > Behalf Of Michelle Chu
**> > Sent: Tuesday, February 05, 2008 9:00 AM
**> > To: R-help_at_r-project.org
**> > Subject: [R] Maximum number of variables allowed in a multiple
**> > linearregression model
**> >
**> > Hi,
**> >
**> > I appreciate it if someone can confirm the maximum number of variables
**> > allowed in a multiple linear regression model. Currently, I am looking for
**> > a software with the capacity of handling approximately 3,000 variables. I
**> > am using Excel to process the results. Any information for processing a
**> > matrix from Excel with hundreds to thousands of variables will helpful.
**> >
**> > Best Regards,
**> > Michelle
**> >
**> > [[alternative HTML version deleted]]
**> >
**> > ______________________________________________
**> > R-help_at_r-project.org mailing list
**> > https://stat.ethz.ch/mailman/listinfo/r-help
**> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
**> > and provide commented, minimal, self-contained, reproducible code.
**> >
**> > ______________________________________________
**> > R-help_at_r-project.org mailing list
**> > https://stat.ethz.ch/mailman/listinfo/r-help
**> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
**> > and provide commented, minimal, self-contained, reproducible code.
**> >
**>
**> ______________________________________________
**> R-help_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
**> and provide commented, minimal, self-contained, reproducible code.
**>
*

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 07 Feb 2008 - 01:59:10 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 07 Feb 2008 - 04:30:14 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*