From: Andrew Robinson <A.Robinson_at_ms.unimelb.edu.au>

Date: Tue, 03 May 2011 11:27:38 +1000

Date: Tue, 03 May 2011 11:27:38 +1000

On Mon, May 02, 2011 at 05:22:57PM -0400, Clemontina Alexander wrote:

> Thanks for your response, but I guess I didn't make my question clear.

*> I am already familiar with the concept of dummy variables and
**> regression in R. My question is, can the "lars" package (or some other
**> lasso algorithm) handle factors? I did use dummy variables in my
**> original data, but lars (lasso) only shrank the coefficients of some
**> of the levels of one factor to 0. Is this the correct thing to do?
*

It's because, so far as the linear model is concerned, factors are a convenience to help us handle the dummy variables. So, yes, it's the correct thing to do. It sounds to me as though you are after a shrinkage device that will treat the factor as a whole.

> Because intuitively it seems like I would want to shrink the whole

*> factor coefficient to 0. If this is correct, what is the
**> interpretation? For example, for X1, if lasso drops the coefficient
**> for levels A and B, but not C and D, does this mean that X1 should be
**> included in the model?
*

It means that X1 should be recoded to be C, D, and the rest.

Cheers

Andrew

> Thanks.

*>
**>
**>
**> On Mon, May 2, 2011 at 2:47 PM, David Winsemius <dwinsemius_at_comcast.net> wrote:
**> >
**> > On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote:
**> >
**> >> Hi,
**> >>
**> >> On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckalexa2_at_ncsu.edu>
**> >> wrote:
**> >>>
**> >>> Hi! This is my first time posting. I've read the general rules and
**> >>> guidelines, but please bear with me if I make some fatal error in
**> >>> posting. Anyway, I have a continuous response and 29 predictors made
**> >>> up of continuous variables and nominal and ordinal categorical
**> >>> variables. I'd like to do lasso on these, but I get an error. The way
**> >>> I am using "lars" doesn't allow for the factors. Is there a special
**> >>> option or some other method in order to do lasso with cat. variables?
**> >>>
**> >>> Here is and example (considering ordinal variables as just nominal):
**> >>>
**> >>> set.seed(1)
**> >>> Y <- rnorm(10,0,1)
**> >>> X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE))
**> >>> X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE))
**> >>> X3 <- sample(x=30:55, size=10, replace=TRUE) # think age
**> >>> X4 <- rchisq(10, df=4, ncp=0)
**> >>> X <- data.frame(X1,X2,X3,X4)
**> >>>
**> >>>> str(X)
**> >>>
**> >>> 'data.frame': 10 obs. of 4 variables:
**> >>> $ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2
**> >>> $ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3
**> >>> $ X3: int 51 46 50 44 43 50 30 42 49 48
**> >>> $ X4: num 2.86 1.55 1.94 2.45 2.75 ...
**> >>>
**> >>>
**> >>> I'd like to do:
**> >>> obj <- lars(x=X, y=Y, type = "lasso")
**> >>>
**> >>> Instead, what I have been doing is converting all data to continuous
**> >>> but I think this is really bad!
**> >>
**> >> Yeah, it is.
**> >>
**> >> Check out the "Categorical Predictor Variables" section here for a way
**> >> to handle such predictor vars:
**> >> http://www.psychstat.missouristate.edu/multibook/mlt08m.html
**> >
**> > Steve's citation is somewhat helpful, but not sufficient to take the next
**> > steps. You can find details regarding the mechanics of typical linear
**> > regression in R on the ?lm page where you find that the factor variables are
**> > typically handled by model.matrix. See below:
**> >
**> >> model.matrix(~X1 + X2 + X3 + X4, X)
**> > (Intercept) X1B X1C X1D X2F X2G X2H X2I X3 X4
**> > 1 1 0 0 1 0 1 0 0 51 2.8640884
**> > 2 1 0 0 0 0 0 1 0 46 1.5462243
**> > 3 1 0 1 0 0 1 0 0 50 1.9430901
**> > 4 1 0 0 0 1 0 0 0 44 2.4504180
**> > 5 1 1 0 0 0 0 0 1 43 2.7535052
**> > 6 1 1 0 0 0 0 0 1 50 1.6200326
**> > 7 1 0 0 0 0 0 0 1 30 0.5750533
**> > 8 1 1 0 0 0 0 0 0 42 5.9224777
**> > 9 1 0 0 1 0 0 0 1 49 2.0401528
**> > 10 1 1 0 0 0 1 0 0 48 6.2995288
**> > attr(,"assign")
**> > [1] 0 1 1 1 2 2 2 2 3 4
**> > attr(,"contrasts")
**> > attr(,"contrasts")$X1
**> > [1] "contr.treatment"
**> >
**> > attr(,"contrasts")$X2
**> > [1] "contr.treatment"
**> >
**> > The numeric variables are passed through, while the dummy variables for
**> > factor columns are constructed (as treatment contrasts) and the whole thing
**> > it returned in a neat package.
**> >
**> > --
**> > David.
**> >>
**> >> HTH,
**> >> -steve
**> >>
**> > --
**> > David Winsemius, MD
**> > Heritage Laboratories
**> > West Hartford, CT
**> >
**> >
**>
**> ______________________________________________
**> R-help_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
**> and provide commented, minimal, self-contained, reproducible code.
*

-- Andrew Robinson Program Manager, ACERA Department of Mathematics and Statistics Tel: +61-3-8344-6410 University of Melbourne, VIC 3010 Australia (prefer email) http://www.ms.unimelb.edu.au/~andrewpr Fax: +61-3-8344-4599 http://www.acera.unimelb.edu.au/ Forest Analytics with R (Springer, 2011) http://www.ms.unimelb.edu.au/FAwR/ Introduction to Scientific Programming and Simulation using R (CRC, 2009): http://www.ms.unimelb.edu.au/spuRs/ ______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.Received on Thu 05 May 2011 - 06:25:05 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 05 May 2011 - 07:00:05 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*