Re: [R] Plotting league tables/ caterpillar plots

From: Chris Evans <chris_at_psyctc.org>
Date: Tue 25 Jul 2006 - 01:39:35 EST

Eik Vettorazzi sent the following at 24/07/2006 13:04:
> Dear list,
> I was wondering if there is a function to plot league tables, sometimes
> also known as "caterpillar plots"?
> A league table is conceptually very similar to a box plot. One difference
> is that the inter-quartile ranges are not shown. If there isn't such a
> function a first attempt for a "selfmade" plot would be to tell boxplot
> not to plot boxes (sounds silly isn't it?). I've tried the option
> "boxwex=0" but the result is unsatisfactory.
>
> An example for a league table can be found in Marshall, Spiegelhalter
> [1998], Reliability of league tables of in vitro fertilisation clinics,
> BMJ1998;316:1701-1705, you may find it at http://bmj.bmjjournals.com

Interesting, never heard the term "caterpillar plot" but I like that. I also call related things "biplane plots": the body is the parameter and the wings (not really biplanes at all are they) are the 95% CI for the parameter. However, these caterpillar plots are rather like forest plots or "trees plots" (I argue that you can see both the wood and the trees!)

I tend to plot them rotated by 90 degrees and without such good labelling as in the Spiegelhalter paper, but this code may help you. The "traffic lights" in this superimpose overall quartiles which is used in some colleagues' way of looking at league tables. I've used similar plots for proportions, means, correlations and even alpha reliability parameters, it's fairly easy to substitute different parameters and their CIs. I prefer to use a more exact estimator of the CI of the median than the one that, if I remember rightly, is the default in the boxplot which is, I think a much quicker but less robust estimator of the CI for each sample.

I'd appreciate gentle constructive criticism of my coding, I know I'm a much better psychotherapist than a programmer!

C

-- 
Chris Evans <chris@psyctc.org>
Professor of Psychotherapy, Nottingham University;
Consultant Psychiatrist in Psychotherapy, Rampton Hospital;
Research Programmes Director, Nottinghamshire NHS Trust;
Hon. SL Institute of Psychiatry, Hon. Con., Tavistock & Portman Trust
**If I am writing from one of those roles, it will be clear. Otherwise**

**my views are my own and not representative of those institutions    **

## user controlled initialising title.txt <- paste("") dependent <- E1SESSPL indeped <- DBSITE ### special temporary addition for planned number of sessions dependent[dependent==0] <- NA missing.crit <- 50 # if the percentage of non-missing values for each level of the factor falls below this, colour changes lights.on <- 0 colour.not.in.final <- 1 # legend.pos <- "bottomright" # options are "topleft" or "bottomright" # legend.pos <- "topleft" # options are "topleft" or "bottomright" legend.pos <- "topright" # options are "topleft","topright" or "bottomright" sort.ord.flag <- "median" # options are "site","size","median" #sort.ord.flag <- "size" # options are "site","size","median" #sort.ord.flag <- "site" # options are "site","size","median" ## get the values #xlow <- min(indeped) #xhigh <- max(indeped) tmp.dat1 <- cbind(dependent,indeped) tmp.dat2 <- na.omit(tmp.dat1) tmp.val.dep <- tmp.dat2[,1] tmp.val.indep <- tmp.dat2[,2] tmp.n <- table(tmp.val.indep) site.names <- as.numeric(names(tmp.n)) xlen <- length(site.names) xlow <- 1 xhigh <- xlen print(xlen) ## OK deal with, and count, missings print(tmp.n) print(sum(tmp.n)) grand.n <- length(indeped) usable.n <- length(tmp.val.dep) n.missing <- grand.n - usable.n perc.missing <- round(100*round(n.missing/grand.n,2),2) grand.n.site <- table(indeped) miss.triggered <- 0 #ord.site <- order(as.numeric(names(tmp.n))) ## set up storage lwr <- rep(0,xlen) skip.site <- lwr site.median <- lwr upr <- lwr n.site <- lwr perc.site <- lwr prop <- lwr n.sig.low <- 0 n.sig.high <- 0 ## need to find limits yhigh <- max(tmp.val.dep) ylow <- min(tmp.val.dep) #yhigh <- 4 #ylow <- 0 ## now use it ref.median <- median(tmp.val.dep) overall.max <- round(max(tmp.val.dep),2) overall.min <- round(min(tmp.val.dep),2) for (i in 1:xlen) { tmpval.site <- site.names[i] tmp.site.dat <- tmp.val.dep[tmp.val.indep==tmpval.site] n.site[i] <- length(tmp.site.dat) perc.site[i] <- 100*n.site[i]/grand.n.site[i] if (n.site[i] <= 5) { prop[i] <- -1 ##### impossible prop < 0 useful for sort order and to omit these ##### carried over from binconf version of this } else { tmp.ci.med <- conf.med(tmp.site.dat) if (is.na(tmp.ci.med$lower) | is.na(tmp.ci.med$upper)) { prop[i] <- 1 } else { lwr[i] <- tmp.ci.med$lower upr[i] <- tmp.ci.med$upper site.median[i] <- tmp.ci.med$median if (lwr[i] > ref.median) n.sig.high <- n.sig.high + 1 # increment counter for high scoring service if (upr[i] < ref.median) n.sig.low <- n.sig.low + 1 # ditto for low } } } min.median <- signif(min(site.median),3) max.median <- signif(max(site.median),3) #yhigh <- max(upr) #ylow <- min(lwr) ## use those to set the extreme CIs where these weren't calculable upr[prop==-1] <- yhigh lwr[prop==-1] <- ylow ## end of doing that! ylen <- yhigh-ylow yinc <- ylen/100 ### *** this scale handling really needs rewriting *** ### ## adjusting block ## # set up increments for text etc. ylen <- yhigh - ylow xinc <- 1 # might want to toy with this yinc <- ylen/100 if (legend.pos == "topleft") { legpos.x <- xlow ; legpos.y <- yhigh } if (legend.pos == "bottomright") { legpos.x <- xhigh - 15*xinc ; legpos.y <- ylow + 30*yinc } if (legend.pos == "topright") { legpos.x <- xhigh - 15*xinc ; legpos.y <- yhigh } ### *** end of scale handling block *** ### ### plotting block ### par(adj=.5, col=1, pch=1, lwd=1, lty=1) # just resetting the graphic parameters plot(upr,xlim=c(xlow,xhigh),ylim=c(ylow,yhigh),ylab="Site median",xaxt="n",xlab="Services",type="n") title(title.txt) ## next section puts in the horizontal ablines and traffic light coding if that's required if (lights.on) { # add the "traffic light" areas t.lights <- (quantile(site.median)) # gets the five quantile summary for the traffic light boundaries polygon(c(xlow-1,xlow-1,xhigh+1,xhigh+1),c(t.lights[1],t.lights[2],t.lights[2],t.lights[1]),col=8) polygon(c(xlow-1,xlow-1,xhigh+1,xhigh+1),c(t.lights[2],t.lights[3],t.lights[3],t.lights[2]),col=5) polygon(c(xlow-1,xlow-1,xhigh+1,xhigh+1),c(t.lights[3],t.lights[4],t.lights[4],t.lights[3]),col=12) polygon(c(xlow-1,xlow-1,xhigh+1,xhigh+1),c(t.lights[4],t.lights[5],t.lights[5],t.lights[4]),col=4) abline(h=t.lights) } else { abline(h=min.median) abline(h=max.median) } # put in the overall proportion # (which isn't the median of the sites' proportions and rarely would be) par(lwd=2) if (lights.on) par(lty=2) abline(h=ref.median) if (lights.on) par(lty=1) par(lwd=1) ## end of ablines and traffic lights ## ## sort order if (sort.ord.flag == "site") sort.ord <- ord.site if (sort.ord.flag == "size") sort.ord <- order(apply(tmp.n,2,sum)) if (sort.ord.flag == "median") sort.ord <- order(site.median) ## plot the data for each level of the independent factor for (i in 1:xlen) { ## depending on the data available, format the plotting for this site if (prop[sort.ord][i] >= 0) { # n for the site was > 1 if (perc.site[i] > missing.crit){ # and enough non-missing data par(col=1) } else { par(col=8) # n large enough but too many missings miss.triggered <- 1 if (colour.not.in.final) par(lty=2) } } else { par(lty=2) # n was tiny par(col=8) } lines(c(i,i),c(lwr[sort.ord][i],upr[sort.ord][i])) # the CI line par(pch=22) # CI limits points(i,lwr[sort.ord][i]) points(i,upr[sort.ord][i]) par(lwd=4) points(i,site.median[sort.ord][i]) # the median par(pch=1,lwd=1) par(crt=90,adj=0,cex=.75,col=8) text(i,upr[sort.ord][i]+1*yinc,site.names[sort.ord][i]) par(col=6,crt=0,adj=.5) if (add.n) { # add the n if required text(i,lwr[sort.ord][i]-2*yinc,n.site[sort.ord][i]) } par(crt=0,adj=.5,cex=1,col=1,lty=1,lwd=1) # reset the graphic formatting } if (min.median != 0) ratio <- as.character(signif(max.median/min.median,3)) else ratio <- "Inf." ref.median <- round(ref.median,2) tmp.kw <- unlist(kruskal.test(dependent,indeped)) # one day I'll remember how to paste variable names! kw.chisq <- round(as.numeric(tmp.kw[1]),2) kw.df <- as.numeric(tmp.kw[2]) kw.p <- round(as.numeric(tmp.kw[3]),4) if (!lights.on) { par(cex=.75,adj=0) tmptxt <- paste("Overall median = ",ref.median," on range from ",overall.min," to ",overall.max,sep="") text(legpos.x,legpos.y-2*yinc,tmptxt) tmptxt <- paste("From n = ",usable.n," n(miss) = ",n.missing," %(miss) = ",perc.missing,sep="") text(legpos.x,legpos.y-5*yinc,tmptxt) tmptxt <- paste("Highest median = ",max.median," lowest = ",min.median,sep="") text(legpos.x,legpos.y-8*yinc,tmptxt) tmptxt <- paste("Ratio, max:min = ",ratio,sep="") text(legpos.x,legpos.y-11*yinc,tmptxt) tmptxt <- paste("Kruskal-Wallis chi square = ",kw.chisq," d.f. = ",kw.df," p = ",kw.p) text(legpos.x,legpos.y-14*yinc,tmptxt) tmptxt <- paste("Number \"significantly\" high",n.sig.high) text(legpos.x+xinc,legpos.y-17*yinc,tmptxt) tmptxt <- paste("Number \"significantly\" low",n.sig.low) text(legpos.x+xinc,legpos.y-20*yinc,tmptxt) tmp <- n.sig.low+n.sig.high tmptxt <- paste("Number \"significantly\" different",tmp) text(legpos.x+xinc,legpos.y-23*yinc,tmptxt) if (miss.triggered) { tmp <- round(100 - missing.crit,2) if (colour.not.in.final) { tmptxt <- paste("Dotted CI indicates %(miss) > ",tmp,"%",sep="") text(legpos.x,legpos.y-26*yinc,tmptxt) } else { tmptxt <- paste("Dotted red CI indicates %(miss) > ",tmp,"%",sep="") text(legpos.x,legpos.y-26*yinc,tmptxt) } } } else { # graph too cluttered for all that if doing traffic lights so just # list the traffic light boundaries tmptxt1 <- paste("Quantile boundaries at: ") tmptxt2 <- paste(round(t.lights, 1),collapse=", ") tmptxt <- paste(tmptxt1,tmptxt2) text(legpos.x,legpos.y-24*yinc,tmptxt) } par(adj=.5, col=1, pch=1, lwd=1, lty=1) # just resetting the graphic parameters

conf.med <- function(x) { # Luc Wouters (wouters@janbe.jnj.com) sent me # the following function: # ----------Message from Luc Wouters------------------------- # A distribution-free confidence interval on the median assuming only iid # is based on the order statistics and the binomial distribution. See # Lehmann (Nonparametrics: Statistical Methods Based on Ranks, # Holden-Day, 1975, p.182-183). # The following S-Plus function can be used to set up such a 95% # distribution-free confidence interval on the median: v <- sort(x, na.last = NA) n <- length(x) if(n > 0) { m <- median(x) i <- qbinom(0.025, n, 0.5) if(i > 0) r <- c(m, v[i], v[n - i + 1]) else r <- c(m, NA, NA) } else r <- c(NA, NA, NA) r <- as.data.frame(list(median = r[1], lower = r[2], upper = r[3])) class(r) <- "table" r } #conf.med(na.omit(bdi1$CORE14A)) tmp <- read.table("d:\\Data\\TCs\\ATC_Quality_Assurance_Panel\\Data3\\SP2000\\BDI1CORE14A.txt",as.is=TRUE) conf.med(na.omit(tmp$V1))

______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

Received on Tue Jul 25 04:25:55 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 25 Jul 2006 - 06:22:25 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.