# [R] covariate selection in cox model (counting process)

From: Mayeul KAUFFMANN <mayeul.kauffmann_at_tiscali.fr>
Date: Thu 29 Jul 2004 - 03:08:35 EST

> If you can get the conditional independence (martingaleness) then, yes,
> BIC is fine.
>
> One way to check might be to see how similar the standard errors are
with
> and without the cluster(id) term.

(Thank you "again !", Thomas.)

At first look, the values seemed very similar (see below, case 2). However, to check this without being too subjective, and without a specific test, I needed other values to assess the size of the differences: what is similar, what is not?

CASE 1
I first estimated the model without modeling dependence:

Call:
coxph(formula = Surv(start, stop, status) ~ cluster(ccode) +

pop + pib + pib2 + crois + instab.x1 + instab.autres, data = xstep)

```                 coef exp(coef) se(coef) robust se     z       p
pop            0.3606     1.434   0.0978    0.1182  3.05 2.3e-03
pib           -0.5947     0.552   0.1952    0.1828 -3.25 1.1e-03
pib2          -0.4104     0.663   0.1452    0.1270 -3.23 1.2e-03
crois         -0.0592     0.943   0.0245    0.0240 -2.46 1.4e-02
instab.x1      2.2059     9.079   0.4692    0.4097  5.38 7.3e-08
instab.autres  0.9550     2.599   0.4700    0.4936  1.93 5.3e-02

```

Likelihood ratio test=74 on 6 df, p=6.2e-14 n= 7286

There seems to be a strong linear relationship between standard errors (se, or naive se) and robust se.

> summary(lm(sqrt(diag(cox1\$var))~ sqrt(diag(cox1\$naive.var)) -1))
Coefficients:

```                           Estimate Std. Error t value Pr(>|t|)
```
sqrt(diag(cox1\$naive.var)) 0.96103 0.04064 23.65 2.52e-06 *** Multiple R-Squared: 0.9911, Adjusted R-squared: 0.9894

CASE 2 Then I added a variable (pxcw) measuring the proximity of the previous event (1>pxcw>0)

n= 7286

```                 coef exp(coef) se(coef) robust se     z       p
pxcw           0.9063     2.475   0.4267    0.4349  2.08 3.7e-02
pop            0.3001     1.350   0.1041    0.1295  2.32 2.0e-02
pib           -0.5485     0.578   0.2014    0.1799 -3.05 2.3e-03
pib2          -0.4033     0.668   0.1450    0.1152 -3.50 4.6e-04
crois         -0.0541     0.947   0.0236    0.0227 -2.38 1.7e-02
instab.x1      1.9649     7.134   0.4839    0.4753  4.13 3.6e-05
instab.autres  0.8498     2.339   0.4693    0.4594  1.85 6.4e-02

```

Likelihood ratio test=78.3 on 7 df, p=3.04e-14 n= 7286

```                           Estimate Std. Error t value Pr(>|t|)
```
sqrt(diag(cox1\$naive.var)) 0.98397 0.02199 44.74 8.35e-09 *** Multiple R-Squared: 0.997, Adjusted R-squared: 0.9965

The naive standard errors (se) seem closer to the robust se than they were when not modeling for dependence.
0.98397 is very close to one, R^2 grew, etc. The dependence is high (risk is multiplied by 2.475 the day after an event)
but conditional independence (given covariates) seems hard to reject.

CASE 3
Finally, I compared these results with those without repeated events (which gives a smaller dataset). A country is removed as soon as we observe its first event.
(robust se is still computed, even if naive se should in fact be used here to compute the pvalue)

coxph(formula = Surv(start, stop, status) ~ cluster(ccode) +

pop + pib + pib2 + crois + instab.x1 + instab.autres, data = xstep[no.previous.event, ])

```                 coef exp(coef) se(coef) robust se     z       p
pop            0.4236     1.528   0.1030    0.1157  3.66 2.5e-04
pib           -0.7821     0.457   0.2072    0.1931 -4.05 5.1e-05
pib2          -0.3069     0.736   0.1477    0.1254 -2.45 1.4e-02
crois         -0.0432     0.958   0.0281    0.0258 -1.67 9.5e-02
instab.x1      1.9925     7.334   0.5321    0.3578  5.57 2.6e-08
instab.autres  1.3571     3.885   0.5428    0.5623  2.41 1.6e-02

```

Likelihood ratio test=66.7 on 6 df, p=1.99e-12 n=5971 (2466 observations deleted due to missing)

> summary(lm(sqrt(diag(cox1\$var))~ sqrt(diag(cox1\$naive.var)) -1))

```                           Estimate Std. Error t value Pr(>|t|)
```
sqrt(diag(cox1\$naive.var)) 0.86682 0.07826 11.08 0.000104 *** Residual standard error: 0.06328 on 5 degrees of freedom Multiple R-Squared: 0.9608, Adjusted R-squared: 0.953

There seems to be no evidence that robust se is more different from se in case 2 than in case 3 (and case 1).
It even seems closer.

I conclude that conditional independence (martingaleness) cannot be rejected in CASE 2, when modeling the dependence between events with a covariate.

Mayeul KAUFFMANN
Univ. Pierre Mendes France
Grenoble - France

> > Then, there is still another option. In fact, I already modelled
> > explicitely the influence of past events with a "proximity of last
event"
> > covariate, assuming the dependence on the last event decreases at a
> > constant rate (for instance, the proximity covariate varies from 1 to
0.5
> > in the first 10 years after an event, then from 0.5 to 0.25 in the
next
> > ten years, etc).
> >
> > With a well chosen modelisation of the dependence effect, the events
> > become conditionnaly independent, I do not need a +cluster(id) term,
and I
> > can use fit\$loglik to make a covariate selection based on BIC, right?
>
> If you can get the conditional independence (martingaleness) then, yes,
> BIC is fine.
>
> One way to check might be to see how similar the standard errors are
with
> and without the cluster(id) term.

R-help@stat.math.ethz.ch mailing list