[R] Aggregation of data frame with calculations of proportions

From: Mark Wardle <mark_at_wardle.org>
Date: Tue, 26 Jun 2007 20:14:00 +0100


Dear all,

I have been stuck on this problem, am rather struggling and would appreciate some advice if anyone can help. I apologise if this is a bit long-winded, but I've tried to limit it to the bare essentials, but don't know how to make it more generic!

I have some slightly odd real world data that I'm looking at representing number of positive diagnoses for different diseases, plus the number tested. It's all in a data frame:

This should all be directly runnable with cut+paste:

sp = read.csv('http://www.wardle.org/sca-prev.csv', header=T)[,c(-1,-2)]

str(sp)

#'data.frame':   46 obs. of  19 variables:

# $ sca1 : int 5 1 NA NA 48 1 NA 17 21 4 ...
# $ sca2 : int 7 3 NA NA 53 1 NA NA 7 7 ...
# $ sca3 : int 3 1 NA NA 2 0 NA 29 0 0 ...
# $ sca6 : int 1 0 NA NA 2 2 NA NA 0 0 ...
# $ sca7 : int NA NA NA NA 2 0 NA NA 1 0 ...
# $ sca8 : int NA NA NA NA 2 1 NA NA NA NA ...
# $ sca10 : int NA NA NA NA 0 0 NA NA NA NA ...
# $ sca12 : int NA NA NA NA 0 0 NA NA NA NA ...
# $ sca17 : int NA NA NA NA 2 1 NA NA NA NA ...
# $ frda : int NA NA NA NA 0 1 NA NA NA NA ...
# $ drpla : int NA NA 1 1 1 1 NA NA 0 0 ...
# $ fmr1 : int NA NA NA NA NA NA 6 NA NA NA ...
# $ diagnosis : int 16 5 1 1 112 8 6 46 29 11 ...
# $ unexplained : int 10 26 415 392 71 35 137 51 0 11 ...
# $ total : int 26 31 416 393 183 43 143 97 29 22 ...
# $ patient.type: Factor w/ 8 levels "","ADCA","ADCA I",..: 2 5 2 5 2
4 8 3 2 2 ...
# $ age.group : Factor w/ 5 levels "","<40",">40",..: 5 5 5 5 1 1 4 5 1 1 ...
# $ country : Factor w/ 19 levels "","Australia",..: 7 7 1 1 8 8 8 1 8 8 ...
# $ region : Factor w/ 6 levels "Americas","Asia",..: 4 4 3 3 3 3
3 3 3 3 ...

# It is straightforward to aggregate data:

smart.sum <- function(x,...) {

	if (is.numeric(x)) return(sum(x, na.rm=T))
	else return(paste(unique(x), collapse=", "))		# for natbib citations
}
sp.ag1 = aggregate(sp, by=list(region=sp$region), FUN=smart.sum) sp.ag2 = aggregate(sp, by=list(patient.type=sp$patient.type), FUN=smart.sum) print(sp.ag1)

# and even to calculate crude percentages
spp = sp.ag1
spp[,2:15] = sapply(spp[,2:15], FUN=function(x) { round(x*100/ spp$total, 1)})

# *BUT*, the problem is that this is underestimating the true
proportions, because some of the data is NA. This means the diagnosis was not tested rather than not found (which would be zero).

# What I want to do is calculate a true proportion, with the number
found divided by the number tested.
# I have managed to do this:

# calculates true proportion by only including values in the
denominator that have non-NAs in the numerator
# x= vector of numbers
# total = vector of numbers

calc.prevalence <- function(x, total) {
	sum.x = sum(x, na.rm=T)		# calculate numerator
	sum.y = sum(total[!is.na(x)])		# calculate denominator
	return(sum.x/sum.y)

}

correct.prevalence <- calc.prevalence(sp$drpla, sp$total) incorrect.prevalence <- sum(sp$drpla, na.rm=T) / sum(sp$total, na.rm=T) print(correct.prevalence)
print(incorrect.prevalence)

This is easy to apply to one column in one table, but I'm finding it very difficult to do when I manipulate the data using the aggregate() function above.

Is there a straightforward way, while aggregating columns by a variable (number of) factors, in generating the sum of a column and dividing by the sum of another column, only including data from the second column when the first column is not NA. AND is it possible to do that to all of the relevant columns (3:15)?

You won't believe how many iterations of by(), aggregate(), tapply(), apply() and (finally) "for loops" I have tried with no success. The best partial solution involved a combination of by() calling a function calling aggregate(), but I can't parse the data returned. I'm sure I am missing something! Can anyone help?

Many (many many) thanks!

Best wishes,

Mark

P.S. this is enough to drive me to drink!

-- 
Dr. Mark Wardle
Clinical research fellow and specialist registrar, Neurology
Cardiff, UK

______________________________________________
R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Tue 26 Jun 2007 - 19:49:31 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 27 Jun 2007 - 00:32:35 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.