Re: [R] A stats question -- about survival analysis and censoring

From: Geoff Russell <geoffrey.russell_at_gmail.com>
Date: Tue, 11 Mar 2008 21:17:04 +1030

Hi Max, Matthias and David and anyone else interested,

My question was prompted by seeing various prospective cohort studies which are deriving the RR of colorectal cancer due to smoking and which simply censor
people who die from diseases other then colorectal cancer and then use coxph to get an estimate of the CRC due to smoking. When (and only when) those diseases are also caused by smoking then I think this reduces the apparent risk ratio of crc due to smoking. The following R code summarises what I think is happening and why simply censoring these other deaths is wrong. Mind you I have absolutely no idea how these studies should proceed, its just that it looks wrong to me and I'm trying to find out what other people think.

Some might argue that dead people can't get crc, but consider a second case analogous to the first. Red meat and crc. If heart disease is censored then the RR of crc due to red meat is reduced. But as heart disease is reduced by better treatment, the crc "suddenly" emerges. The relationship was hidden by virtue of the censoring protocol.

Cheers,
Geoff.

R code follows

library(survival)
set.seed(2)

popsize<-3000
smoking<-popsize/2
nonsmoke<-popsize/2

#----------------------------------------------------------

# generate a bunch of times for people to be either
# censored or diagnosed

#----------------------------------------------------------
time<-c(rnorm(popsize,75,15))
#----------------------------------------------------------
# allocate people to smoking/nonsmoking groups
#----------------------------------------------------------
smokestatus<-c(rep(1,smoking),rep(0,nonsmoke))

#----------------------------------------------------------

# pick a colorectal cancer (crc) rate of 1 in 20 uniformly distributed
# Aussie lifetime rate is about 1 in 17

#----------------------------------------------------------
status<-c(sample(c(0,1),popsize,replace=T,prob=c(0.95,0.05)))
#----------------------------------------------------------
# alternatively generate a higher rate of crc in the smokers
#----------------------------------------------------------
statusAlt<-c(
 	sample(c(0,1),smoking,replace=T,prob=c(0.9,0.1)),
 	sample(c(0,1),nonsmoke,replace=T,prob=c(0.95,0.05))
 	)


#----------------------------------------------------------
# now generate some heart deaths with more in the smokers # The ratio of deaths from heart disease to crc is about 5
# to 1 in EPIC-oxford cohort, and our numbers below are similar.
#----------------------------------------------------------
 heartattack<-c(
 sample(c(0,1),smoking,replace=T,prob=c(0.7,0.3)),  sample(c(0,1),nonsmoke,replace=T,prob=c(0.85,0.15))  )

#----------------------------------------------------------

# people are only diagnosed with CRC if they don't
# die of heart disease --- on assumption that heart
# disease usually occurs before crc

#----------------------------------------------------------
statusHd<-as.integer(status&(!heartattack))

 statusAltHd<-as.integer(statusAlt&(!heartattack))

 data <- data.frame(time,status,smokestatus)  dataHd <- data.frame(time,statusHd,smokestatus)

 dataAlt <- data.frame(time,statusAlt,smokestatus)  dataAltHd <- data.frame(time,statusAltHd,smokestatus)

 cSmoke<-coxph(Surv(time,status)~smokestatus, data)  cHd<-coxph(Surv(time,statusHd)~smokestatus, dataHd)

 cAltSmoke<-coxph(Surv(time,statusAlt)~smokestatus, dataAlt)  cAltHd<-coxph(Surv(time,statusAltHd)~smokestatus, dataAltHd)
#----------------------------------------------------------
# as expected RR for crc of smokers relative to non-smokers is about 1
#----------------------------------------------------------
 summary(cSmoke)
Call:
coxph(formula = Surv(time, status) ~ smokestatus, data = data)

  n= 3000

               coef exp(coef) se(coef)      z    p
smokestatus -0.0221     0.978    0.149 -0.148 0.88

            exp(coef) exp(-coef) lower .95 upper .95
smokestatus     0.978       1.02      0.73      1.31

Rsquare= 0 (max possible= 0.563 )

Likelihood ratio test= 0.02  on 1 df,   p=0.882
Wald test            = 0.02  on 1 df,   p=0.882
Score (logrank) test = 0.02  on 1 df,   p=0.882


#----------------------------------------------------------
# but the RR drops when heart disease takes out crc cases
#----------------------------------------------------------
summary(cHd)
Call:
coxph(formula = Surv(time, statusHd) ~ smokestatus, data = dataHd)

  n= 3000

              coef exp(coef) se(coef)      z    p
smokestatus -0.164     0.848    0.183 -0.898 0.37

            exp(coef) exp(-coef) lower .95 upper .95
smokestatus     0.848       1.18     0.592      1.21

Rsquare= 0 (max possible= 0.424 )

Likelihood ratio test= 0.81  on 1 df,   p=0.369
Wald test            = 0.81  on 1 df,   p=0.369
Score (logrank) test = 0.81  on 1 df,   p=0.369



#----------------------------------------------------------
# when there really is a relationship
#----------------------------------------------------------
 summary(cAltSmoke)

Call:
coxph(formula = Surv(time, statusAlt) ~ smokestatus, data = dataAlt)

  n= 3000

             coef exp(coef) se(coef)    z     p
smokestatus 0.535      1.71    0.137 3.89 1e-04

            exp(coef) exp(-coef) lower .95 upper .95
smokestatus      1.71      0.586      1.30      2.23

Rsquare= 0.005 (max possible= 0.659 )

Likelihood ratio test= 15.7  on 1 df,   p=7.59e-05
Wald test            = 15.1  on 1 df,   p=0.000101
Score (logrank) test = 15.5  on 1 df,   p=8.34e-05


#----------------------------------------------------------
# and the RR drops when heart disease is added
#----------------------------------------------------------
 summary(cAltHd)
Call:
coxph(formula = Surv(time, statusAltHd) ~ smokestatus, data = dataAltHd)

  n= 3000

             coef exp(coef) se(coef)    z     p
smokestatus 0.362      1.44    0.162 2.24 0.025

            exp(coef) exp(-coef) lower .95 upper .95
smokestatus      1.44      0.696      1.05      1.97

Rsquare= 0.002 (max possible= 0.526 )

Likelihood ratio test= 5.09  on 1 df,   p=0.024
Wald test            = 5.01  on 1 df,   p=0.0252
Score (logrank) test = 5.07  on 1 df,   p=0.0244

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 11 Mar 2008 - 10:51:13 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 11 Mar 2008 - 11:30:21 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive