Re: [R] Large number of dummy variables

From: Douglas Bates <bates_at_stat.wisc.edu>
Date: Mon, 21 Jul 2008 18:07:26 -0500

On Mon, Jul 21, 2008 at 5:45 PM, Bert Gunter <gunter.berton_at_gene.com> wrote:
> Unless I'm way off base, dummy variable are never needed (nor are desirable)
> in R; they should be modelled as factors instead. AN INTRO TO R might, and
> certainly V&R's MASS and others will, explain this in more detail.

But Alan wants to have those factors in a linear regression model. If you use lm then it will create a dense model matrix from those factors and that's when you run out of memory.

Alan: I haven't read the whole discussion yet but if you really, really want to use a fixed-effects model with factors that have that many levels then you can form (the transpose of) the sparse model matrix for just those factors using code like

library(Matrix)
MMt <- rBind(as(fac1, "sparseMatrix"), as(fac2, "sparseMatrix")[-1,])

At that point you may be able to use

solve(tcrossprod(MMt), MMt %*% y)

to solve for coefficients. Notice that I have dropped the indicator row for the first level of the second factor but kept all the indicators columns for the first factor. Thus the coefficients correspond to an lm specification of

lm(y ~ 0 + fac1 + fac2, ...)

under the default contrasts.

I'm not sure that is the best way of solving for coefficients. I would need to look at the code for that solve method to see what form of factorization that it uses. Also, I agree with Harold that you really should consider using random effects for those factors. It is almost never a good idea to try to estimate fixed effects for thousands of levels of a factor.

> -- Bert Gunter
> Genentech, Inc.
>
> -----Original Message-----
> From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On
> Behalf Of Doran, Harold
> Sent: Monday, July 21, 2008 3:16 PM
> To: aspearot_at_ucsc.edu; r-help_at_r-project.org
> Cc: Douglas Bates
> Subject: Re: [R] Large number of dummy variables
>
> Well, at the risk of entering a debate I really don't have time for (I'm
> doing it anyway) why not consider a random coefficient model? If your
> response has anything like, "well, random effects and fixed effects are
> correlated and so the estimates are biased but OLS is consistent and
> unbiased via an appeal to Gauss-Markov" then I will probably make time
> for this discussion :)
>
> I have experienced this problem, though. In what you're doing, you are
> first creating the model matrix and then doing the demeaning, correct? I
> do recall Doug Bates was, at one point, doing some work where the model
> matrix for the fixed effects was immediately created as a sparse matrix
> for OLS models. I think doing the work on the sparse matrix is a better
> analytical method than time-demeaning. I don't remember where that work
> is, though.
>
> There is a package called sparseM which had functions for doing OLS with
> sparse matrices. I don't know its status, but vaguely recall the author
> of sparseM at one point noting that the work of Bates and Maechler would
> be the go to package for work with large, sparse model matrices.
>
>> -----Original Message-----
>> From: r-help-bounces_at_r-project.org
>> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Alan Spearot
>> Sent: Monday, July 21, 2008 5:59 PM
>> To: r-help_at_r-project.org
>> Subject: [R] Large number of dummy variables
>>
>> Hello,
>>
>> I'm trying to run a regression predicting trade flows between
>> importers and exporters. I wish to include both
>> year-importer dummies and year-exporter dummies. The former
>> includes 1378 levels, and the latter includes 1390 levels. I
>> have roughly 100,000 total observations.
>>
>> When I'm using lm() to run a simple regression, it give me a
>> "cannot allocate ___" error. I've been able to get around
>> time-demeaning over one large group, but since I have two, it
>> doesn't work in the correct way. Is there a more efficient
>> way to handling a model matrix this large in R?
>>
>> Thanks for your help.
>>
>> Alan Spearot
>>
>> --
>> Alan Spearot
>> Assistant Professor - International Economics University of
>> California - Santa Cruz
>> 1156 High Street
>> 453 Engineering 2
>> Santa Cruz, CA 95064
>> Office: (831) 459-1530
>> acspearot_at_gmail.com
>> http://people.ucsc.edu/~aspearot
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 21 Jul 2008 - 23:11:56 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 21 Jul 2008 - 23:31:56 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive