Date: Sat, 01 Mar 2008 16:32:02 -0800 (PST)

Hi,

I am new to model selection by coefficient shrinkage method such as lasso. And I became particularly interested in variable selection in Cox regression by lasso. I became aware of the coxpath() in R package glmpath does lasso on Cox model. I have tried the sample script on the help page of coxpath(), but I have difficult time understanding the output. Therefore, I would greatly appreciate if anyone can help me understand how to use the function.

*> data(lung.data)
**> attach(lung.data)
**> fit.a <- coxpath(lung.data)
**> print(fit.a)
*

Call:

coxpath(data = lung.data)

Step 1 : karno Step 2 : celltype Step 5 : trt Step 6 : prior Step 7 : age Step 8 : diagtime

*> summary(fit.a)
*

Call:

coxpath(data = lung.data)

Df Log.p.lik AIC BIC

Step 1 0 -505.8840 1011.7679 1011.7679 Step 2 1 -486.0691 974.1382 977.0581 Step 5 2 -484.8520 973.7040 979.5440 Step 6 3 -483.4018 972.8036 981.5636 Step 7 4 -483.3801 974.7602 986.4401 Step 8 5 -483.2287 976.4573 991.0572 Step 9 6 -483.1112 978.2224 995.7423

first of all, why the number of steps between the
above 2 outputs are different? I confirmed with
coxph() that the numbers (log.p.lik, AIC, BIC) on the
1st row of summary(fit.a) are from a NULL Cox model,
i.e. a model with only an intercept. Then how Step 1
in

the output of summary(fit.a) is corresponding to "Step
1" in the output of print(fit.a) where it seems to
mean a model with the variable "karno"?

*>predict(fit.a)
*

trt celltype karno diagtime age prior

1 0.0000 0.0000 0.0000 0.000e+00 0.000e+00 0.000e+00 2 0.0000 0.0076 -0.0256 0.000e+00 0.000e+00 0.000e+00 5 0.0000 0.0450 -0.0286 0.000e+00 0.000e+00 0.000e+00 6 0.1428 0.1033 -0.0330 0.000e+00 0.000e+00-4.326e-05

7 0.1468 0.1048 -0.0332 0.000e+00 -1.043e-07 -3.506e-04

8 0.1755 0.1139 -0.0340 5.642e-06 -1.404e-03 -2.367e-03

attr(,"s")

[1] 1 2 5 6 7 8

attr(,"fraction")

[1] 0.000 0.125 0.500 0.625 0.750 0.875

attr(,"mode")

[1] "step"

Second, if we compare the output of print(fit.a) and predict(fit.a), I can see some discrepancies. For example, "Step 1" of print(fit.a) was variable "karno", however, predict(fit.a) showed that the coefficient of "karno" was still 0. The same went with variable "trt" in "Step 5". What is the meaning of the discrepancies? I think I probably misunderstand the whole meaning of coefficient shrinkage in the first place. So I would appreciate if anyone can shed some lights.

I would also like to have any opinion on how I should do variable selection from these output? Should I rely on the table (log.p.lik, aic, bic) from summary fit.a) , or should I rely on the coefficients table from print(fit.a) to eliminate those variables with 0 coefficients at certain step?

Thank you very much for your time.

