From: Mayeul KAUFFMANN <mayeul.kauffmann_at_tiscali.fr>

Date: Mon 26 Jul 2004 - 22:15:42 EST

R-help@stat.math.ethz.ch mailing list

https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon Jul 26 22:25:13 2004

Date: Mon 26 Jul 2004 - 22:15:42 EST

Hello everyone,

I am searching for a covariate selection procedure in a cox model
formulated

as a counting process.

I use intervals, my formula looks like coxph(Surv(start,stop,status)~
x1+x2+...+cluster(id),robust=T) where id is a country code (I study
occurence of civil wars from 1962 to 1997).
I'd like something not based on p-values, since they have several flaws
for

this purpose.

I turned to other criteria but all the articles I read seems to apply to
the

classical formulation of the cox model, not the counting process one (or
they apply to both but I am not aware of this)

I've tried AIC with

*> step(cox.fit)
*

or

*> stepAIC(cox.fit)
*

and BIC using

*>step(cox.fit,k = log(n))
*

but there seems to be 2 theoretical problems to address:

(1) These values are based on partial loglikelihood ("loglik") I wonder if this is correct with the cox model formulated as a *counting process*, with many (consecutive) observations for a given

individual, and then some observation not being independent

Since "the likelihood ratio and score tests assume independence of observations within a cluster, the Wald and robust score tests

do not" (R warning), and the likelihood ratio being based on loglik, can
I

use loglik in BIC with some dependent observations?

[I have 170 individuals (namely, countries) for 36 year, some single
countries having up to 140 very short observation intervals, other
having

(on

the other extreme) only 1 long interval per year. That's because I
created

artificial covariates measuring the proximity since some

events: exp(-days.since.event/a.chosen.parameter). I splitted every
yearly

interval for which these covariates change rapidly (i.e.

when the events are recent) yielding up to 11 intervals a year]

(2) What penalized term to used?

It seems natural to include the number of covariates, k. What about the number of observations?

I found several definitions:

AIC= -2 loglik(b) + 2.k

Schwartz Bayesian information criteria:

SBIC= -2 loglik(b) + k ln(n)

Frédérique Letué (author of PhD thesis "COX MODEL: ESTIMATION VIA MODEL
**SELECTION AND BIVARIATE SHOCK MODEL",
**

http://www-lmc.imag.fr/lmc-sms/Frederique.Letue/These3.ps)
suggested me

AIC= - loglik(b) + 2.k/n

BIC= - loglik(b) + 2.ln(n).k/n, with other possible values for
parameter

"2" in this case (see her thesis p.100, but this section is in French)

All these do not tell *what to take for n*. There are 3 main possibilities:

- Taking the number of observations (including censored one) will give a huge n (around 6000 to 8000), which may seem meaningless

since some observations are only a few days long.
With n at the denominator (Letué's criteria), the penalized term would
be so

low that it's like not having it:

*>log(7000)/7000
*

[1] 0.001264809

(where loglik from summary(cox.fit) range from -155 to -175, dependig on
the

model)

b) Volinsky & Raftery "propose a revision of the penalty term in BIC so
that

the penalty is defined in terms of the number of

uncensored events instead of the number of observations." (Volinsky & Raftery , Bayesian Information Criterion for Censored

Survival Models, June 16, 1999,

http://www.research.att.com/~volinsky/papers/biocs.ps)

This could be computed with

*>sum(coxph.detail(cox.interac.dsi6mois)$nevent)
*

Letué's BIC penalized trerm with 50 events will then be

*> 2*log(50)/50
*

[1] 0.1564809

which will have more effects.

However, adding or removing a country which has data for the 36 years
but no

event (then, it is censored) will not change this BIC.

Thus, it is not suitable to account for missing data that do not reduce
the

number of event.

I'd like the criteria to take this into account, because all covariates
do

not have the same missing data.

The question is: When I have the choice with adding a covariate, x10 or
x11,

which have different (not nested) set of missing

values, which one is best?

Estimating all subsets of the full model (full model = all covariates)
with

a dataset containing no missing data for the full model

would be a solution but would more than halve the dataset for many
subsets

of the covariates.

I should mention that step(cox.fit) gives a warning and stops:
"Error in step(cox.fit) : number of rows in use has changed: remove
missing

values?"

which makes me ask whether the whole procedure is OK with model of
different

sample size.

c) "For discrete time event history analysis, the same choice has been
made,

while the total number of exposure time units has also

been used, for consistency with logistic regresion
(Raftery,Lewis,Aghajanian

and Kahn,1993;Raftery Lewisand Aghajanian,1994)"

(Raftery, Bayesian Model Selection in Social Research, 1994, http://www.stat.washington.edu/tech.reports/bic.ps)

I am not sure what "exposure time units" mean. But since I could have
used a

logit model with yearly observations [but with many

flaws...], I suggest I could use the number of years (sum of length of intervals, in year)

*> sum((fit$y)[,2]-(fit$y)[,1])/365.25
*

[1] 3759.537

This may still be too high.

Since I have datas from 1962 to 1997 (36 years), I have the folowing
numbers

of "complete cases"-equivalent:

*>sum((fit$y)[,2]-(fit$y)[,1])/ (365.25*36)
*

[1] 104.4316

This seems more resonable and would account for NAs in different models.

However, it might be to high, because some countries are not at risk
during

the all period: some did not existed because they gain

independence near the end of the period (E.G. ex-USRR countries in arly 1990's) or because they were experiencing an event (new

wars in countries already experiencing a war are not taken into account).

I may take the *proportion* of available data to time at risk to adjust
for

this: a country at risk during 1 year and for which

data are available for this entire year will increase n by 1, not by
1/36.

If the data frame "dataset" contains all countries at risk (including
some

with NA), assuming id == id[complete.cases(dataset)]

(all countries have at least one complete observation) this will be
computed

by

*>attach(dataset)
**>sum(tapply(stop[complete.cases(dataset)]-start[complete.cases(dataset)]
***,CCO
**

DE,sum) /tapply(stop-start,CCODE,sum))

But this would be a rather empirical adjustment, maybe with no
theoretical

basis.

And I don't think I can enter this as argument k to step()....

Thank you for having read this. Hope I was not too long. And thank you a lot for any help, comment, etc.

(sorry for mistakes in English as I'm a non native English speaker)

Mayeul KAUFFMANN

Univ. Pierre Mendes France

Grenoble - France

R-help@stat.math.ethz.ch mailing list

https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon Jul 26 22:25:13 2004

*
This archive was generated by hypermail 2.1.8
: Fri 18 Mar 2005 - 02:38:26 EST
*