Re: [R] Large number of dummy variables

From: Martin Maechler <>
Date: Tue, 22 Jul 2008 16:07:14 +0200

>>>>> "HaroldD" == Doran, Harold <> >>>>> on Mon, 21 Jul 2008 19:15:37 -0400 writes:

    HaroldD> Well, yes and no. In R there really isn't a need to create the model matrix because this is done in R from the factors. But, to implement this computational trick Alan is asking about, it requires that he first create the full, dense model matrix and the do the time-demeaning on that matrix.

    HaroldD> If lm() could go straight from a factor to a sparse     HaroldD> model matrix, time-demeaning would not be necessary.

Well, lm() is in "stats" would only work with dense matrices anyway.
But you are right in what you *meant*: We'd need versions of model.frame() and model.matrix() which from a formula produce a sparse model matrix (aka "X matrix") or its transpose.
Doug Bates showed you how to do the latter manually, equivalently to model.matrix(~ 0 + f1 + f2) when f1 and f2 are factors.

I'm sure that longer-term we'd want versions of model.matrix() / model.frame() that work with sparse matrices.

    HaroldD> Doing work as Doug suggests in the other     HaroldD> post is what would be best for now, me thinks.

BTW, you mentioned SparseM's "OLS with sparse matrices". The problem there is the same as with 'Matrix': You must somehow get your sparse X matrix and the best currrent tools to that, AFAIK, are the ones in 'Matrix' Doug Bates mentioned (and wrote!).

Martin Maechler

    HaroldD> -----Original Message-----
    HaroldD> From: Bert Gunter []
    HaroldD> Sent: Mon 7/21/2008 6:45 PM
    HaroldD> To: Doran, Harold;;
    HaroldD> Subject: RE: [R] Large number of dummy variables
    HaroldD> Unless I'm way off base, dummy variable are never needed (nor are desirable)
    HaroldD> in R; they should be modelled as factors instead. AN INTRO TO R might, and     HaroldD> certainly V&R's MASS and others will, explain this in more detail.

    HaroldD> -- Bert Gunter
    HaroldD> Genentech, Inc.

    HaroldD> -----Original Message-----
    HaroldD> From: [] On
    HaroldD> Behalf Of Doran, Harold
    HaroldD> Sent: Monday, July 21, 2008 3:16 PM
    HaroldD> To:;
    HaroldD> Cc: Douglas Bates
    HaroldD> Subject: Re: [R] Large number of dummy variables

    HaroldD> Well, at the risk of entering a debate I really don't have time for (I'm
    HaroldD> doing it anyway) why not consider a random coefficient model? If your
    HaroldD> response has anything like, "well, random effects and fixed effects are
    HaroldD> correlated and so the estimates are biased but OLS is consistent and
    HaroldD> unbiased via an appeal to Gauss-Markov" then I will probably make time     HaroldD> for this discussion :)
    HaroldD> I have experienced this problem, though. In what you're doing, you are
    HaroldD> first creating the model matrix and then doing the demeaning, correct? I
    HaroldD> do recall Doug Bates was, at one point, doing some work where the model
    HaroldD> matrix for the fixed effects was immediately created as a sparse matrix
    HaroldD> for OLS models. I think doing the work on the sparse matrix is a better
    HaroldD> analytical method than time-demeaning. I don't remember where that work
    HaroldD> is, though. 

    HaroldD> There is a package called sparseM which had functions for doing OLS with
    HaroldD> sparse matrices. I don't know its status, but vaguely recall the author
    HaroldD> of sparseM at one point noting that the work of Bates and Maechler would     HaroldD> be the go to package for work with large, sparse model matrices.

>> -----Original Message-----
>> From:
>> [] On Behalf Of Alan Spearot
>> Sent: Monday, July 21, 2008 5:59 PM
>> To:
>> Subject: [R] Large number of dummy variables
>> Hello,
>> I'm trying to run a regression predicting trade flows between
>> importers and exporters. I wish to include both
>> year-importer dummies and year-exporter dummies. The former
>> includes 1378 levels, and the latter includes 1390 levels. I
>> have roughly 100,000 total observations.
>> When I'm using lm() to run a simple regression, it give me a
>> "cannot allocate ___" error. I've been able to get around
>> time-demeaning over one large group, but since I have two, it
>> doesn't work in the correct way. Is there a more efficient
>> way to handling a model matrix this large in R?
>> Thanks for your help.
>> Alan Spearot
>> --
>> Alan Spearot
>> Assistant Professor - International Economics University of
>> California - Santa Cruz
>> 1156 High Street
>> 453 Engineering 2
>> Santa Cruz, CA 95064
>> Office: (831) 459-1530
    >> mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Tue 22 Jul 2008 - 14:42:27 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 22 Jul 2008 - 15:01:57 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive