Re: [R] converting factors to dummy variables

From: Charles C. Berry <cberry_at_tajo.ucsd.edu>
Date: Tue, 4 Dec 2007 20:26:39 -0800

On Wed, 5 Dec 2007, Tim Calkins wrote:

> Hi all -
>
> I'm trying to find a way to create dummy variables from factors in a
> regression. I have been using biglm along the lines of
>
> ff <- log(Price) ~ factor(Colour):factor(Store) +
> factor(DummyVar):factor(Colour):factor(Store)
>
> lm1 <- biglm(ff, data=my.dataset)
>
> but because there are lots of colours (>100) and lots of stores
> (>250), I run it to memory problems. Now, not every store sells every
> colour and so it should be possible to create the matrix of factor
> variables myself and greatly reduce the size of the problem. it seems
> that lm / biglm use all combinations of factor levels when used in
> factor(Colour):factor(Store) so by creating my own matrix of factor
> variables i should be able to reduce the size of the problem
> considerably.
>
> If i have a data frame
>> my.dataset <- data.frame(Price=1:12, Colour= c('red','blue','green'),
> Store=c('a', 'b', 'c', 'a', 'c', 'd', 'e', 'e', 'e', 'e', 'b', 'e'),
> DummyVar = sort(rep(c(0,1),6)) )
>
> i want to create a data frame with the dummy vars that looks like
>
> red:a red:e blue:b blue:c blue:e green:c green:d green:e
> 1 0 0 0 0 0 0 0
> 0 0 1 0 0 0 0 0
> 0 0 0 0 0 1 0 0
> 1 0 0 0 0 0 0 0
> 0 0 0 1 0 0 0 0
> 0 0 0 0 0 0 1 0
> 0 1 0 0 0 0 0 0
> 0 0 0 0 1 0 0 0
> 0 0 0 0 0 0 0 1
> 0 1 0 0 0 0 0 0
> 0 0 1 0 0 0 0 0
> 0 0 0 0 0 0 0 1
>
> any ideas would be appreciated.

Use

mat <- model.matrix( ~ClrStr-1,

 	transform( my.dataset, ClrStr =
 		factor( paste(Colour,Store,sep=":") ) ) )

then pretty up the colnames() and re-order columns if order matters.


However, if DummyVar is a categorical variable, you could just compute means on the appropriate subsets by maintaining a table of sums and totals. Then in a second pass through the data get the residual sums of squares. If the data are already in a database, it might make sense to do these operations there and import the results to R for further massaging.

HTH, Chuck

>
>
> --
> Tim Calkins
> 0406 753 997
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry_at_tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 05 Dec 2007 - 04:31:36 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 05 Dec 2007 - 05:30:17 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.