From: Breheny, Patrick <patrick.breheny_at_uky.edu>

Date: Mon, 02 May 2011 22:26:12 -0400

Patrick Breheny

Assistant Professor

Department of Biostatistics

Department of Statistics

University of Kentucky

From: r-help-bounces_at_r-project.org [r-help-bounces_at_r-project.org] On Behalf Of Clemontina Alexander [ckalexa2_at_ncsu.edu] Sent: Monday, May 02, 2011 5:22 PM

To: David Winsemius

Cc: r-help_at_r-project.org

Subject: Re: [R] Lasso with Categorical Variables

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 05 May 2011 - 06:25:05 GMT

Date: Mon, 02 May 2011 22:26:12 -0400

Clementonia,

It sounds like you are looking for the group lasso (Yuan & Lin, 2006). There are two packages on CRAN that have implemented this idea: grpreg and grplasso. The syntax of each is similar to lars (in particular requiring a numeric design matrix as produced by model.matrix), except you must also supply a vector that describes the grouping (e.g., c(1,1,1,2,2,3,3,...)). The members of each group will then either be all zero or all nonzero (i.e., the variable selection occurs at the group level).

Patrick Breheny

Assistant Professor

Department of Biostatistics

Department of Statistics

University of Kentucky

From: r-help-bounces_at_r-project.org [r-help-bounces_at_r-project.org] On Behalf Of Clemontina Alexander [ckalexa2_at_ncsu.edu] Sent: Monday, May 02, 2011 5:22 PM

To: David Winsemius

Cc: r-help_at_r-project.org

Subject: Re: [R] Lasso with Categorical Variables

Thanks for your response, but I guess I didn't make my question clear.
I am already familiar with the concept of dummy variables and
regression in R. My question is, can the "lars" package (or some other
lasso algorithm) handle factors? I did use dummy variables in my
original data, but lars (lasso) only shrank the coefficients of some
of the levels of one factor to 0. Is this the correct thing to do?
Because intuitively it seems like I would want to shrink the whole
factor coefficient to 0. If this is correct, what is the
interpretation? For example, for X1, if lasso drops the coefficient
for levels A and B, but not C and D, does this mean that X1 should be
included in the model?

Thanks.

On Mon, May 2, 2011 at 2:47 PM, David Winsemius <dwinsemius_at_comcast.net> wrote:

*>
*

> On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote:

*>
**>> Hi,
**>>
**>> On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckalexa2_at_ncsu.edu>
**>> wrote:
**>>>
**>>> Hi! This is my first time posting. I've read the general rules and
**>>> guidelines, but please bear with me if I make some fatal error in
**>>> posting. Anyway, I have a continuous response and 29 predictors made
**>>> up of continuous variables and nominal and ordinal categorical
**>>> variables. I'd like to do lasso on these, but I get an error. The way
**>>> I am using "lars" doesn't allow for the factors. Is there a special
**>>> option or some other method in order to do lasso with cat. variables?
**>>>
**>>> Here is and example (considering ordinal variables as just nominal):
**>>>
**>>> set.seed(1)
**>>> Y <- rnorm(10,0,1)
**>>> X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE))
**>>> X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE))
**>>> X3 <- sample(x=30:55, size=10, replace=TRUE) # think age
**>>> X4 <- rchisq(10, df=4, ncp=0)
**>>> X <- data.frame(X1,X2,X3,X4)
**>>>
**>>>> str(X)
**>>>
**>>> 'data.frame': 10 obs. of 4 variables:
**>>> $ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2
**>>> $ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3
**>>> $ X3: int 51 46 50 44 43 50 30 42 49 48
**>>> $ X4: num 2.86 1.55 1.94 2.45 2.75 ...
**>>>
**>>>
**>>> I'd like to do:
**>>> obj <- lars(x=X, y=Y, type = "lasso")
**>>>
**>>> Instead, what I have been doing is converting all data to continuous
**>>> but I think this is really bad!
**>>
**>> Yeah, it is.
**>>
**>> Check out the "Categorical Predictor Variables" section here for a way
**>> to handle such predictor vars:
**>> http://www.psychstat.missouristate.edu/multibook/mlt08m.html
**>
**> Steve's citation is somewhat helpful, but not sufficient to take the next
**> steps. You can find details regarding the mechanics of typical linear
**> regression in R on the ?lm page where you find that the factor variables are
**> typically handled by model.matrix. See below:
**>
**>> model.matrix(~X1 + X2 + X3 + X4, X)
**> (Intercept) X1B X1C X1D X2F X2G X2H X2I X3 X4
**> 1 1 0 0 1 0 1 0 0 51 2.8640884
**> 2 1 0 0 0 0 0 1 0 46 1.5462243
**> 3 1 0 1 0 0 1 0 0 50 1.9430901
**> 4 1 0 0 0 1 0 0 0 44 2.4504180
**> 5 1 1 0 0 0 0 0 1 43 2.7535052
**> 6 1 1 0 0 0 0 0 1 50 1.6200326
**> 7 1 0 0 0 0 0 0 1 30 0.5750533
**> 8 1 1 0 0 0 0 0 0 42 5.9224777
**> 9 1 0 0 1 0 0 0 1 49 2.0401528
**> 10 1 1 0 0 0 1 0 0 48 6.2995288
**> attr(,"assign")
**> [1] 0 1 1 1 2 2 2 2 3 4
**> attr(,"contrasts")
**> attr(,"contrasts")$X1
**> [1] "contr.treatment"
**>
**> attr(,"contrasts")$X2
**> [1] "contr.treatment"
**>
**> The numeric variables are passed through, while the dummy variables for
**> factor columns are constructed (as treatment contrasts) and the whole thing
**> it returned in a neat package.
**>
**> --
**> David.
**>>
**>> HTH,
**>> -steve
**>>
**> --
**> David Winsemius, MD
**> Heritage Laboratories
**> West Hartford, CT
**>
**>
*

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 05 May 2011 - 06:25:05 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 05 May 2011 - 07:00:05 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*