Re: [R] Aggregation of data frame with calculations of proportions

From: jim holtman <jholtman_at_gmail.com>
Date: Tue, 26 Jun 2007 18:42:49 -0400

I think something like this will do it for you.

set.seed(1)
n <- 10
x <- data.frame(a=sample(1:100,n), b=sample(1:100,n),d=sample(1:100,n))

x$a[c(1,5)] <- NA
x$b[c(7,6,4)] <- NA
x$d[c(5,8)] <- NA
x$total <- rowSums(x, na.rm=TRUE)
x$type <- sample(1:3, n, replace=TRUE)

by(x, list(x$type), function(z){

    apply(z[,1:3], 2, calc.prevalence, total=z$total) })

On 6/26/07, Mark Wardle <mark_at_wardle.org> wrote:
>

> Dear all,
>
> I have been stuck on this problem, am rather struggling and would
> appreciate some advice if anyone can help. I apologise if this is a
> bit long-winded, but I've tried to limit it to the bare essentials,
> but don't know how to make it more generic!
>
> I have some slightly odd real world data that I'm looking at
> representing number of positive diagnoses for different diseases, plus
> the number tested. It's all in a data frame:
>
> This should all be directly runnable with cut+paste:
>
> sp = read.csv('http://www.wardle.org/sca-prev.csv', header=T)[,c(-1,-2)]
>
> str(sp)
>
> #'data.frame': 46 obs. of 19 variables:
> # $ sca1 : int 5 1 NA NA 48 1 NA 17 21 4 ...
> # $ sca2 : int 7 3 NA NA 53 1 NA NA 7 7 ...
> # $ sca3 : int 3 1 NA NA 2 0 NA 29 0 0 ...
> # $ sca6 : int 1 0 NA NA 2 2 NA NA 0 0 ...
> # $ sca7 : int NA NA NA NA 2 0 NA NA 1 0 ...
> # $ sca8 : int NA NA NA NA 2 1 NA NA NA NA ...
> # $ sca10 : int NA NA NA NA 0 0 NA NA NA NA ...
> # $ sca12 : int NA NA NA NA 0 0 NA NA NA NA ...
> # $ sca17 : int NA NA NA NA 2 1 NA NA NA NA ...
> # $ frda : int NA NA NA NA 0 1 NA NA NA NA ...
> # $ drpla : int NA NA 1 1 1 1 NA NA 0 0 ...
> # $ fmr1 : int NA NA NA NA NA NA 6 NA NA NA ...
> # $ diagnosis : int 16 5 1 1 112 8 6 46 29 11 ...
> # $ unexplained : int 10 26 415 392 71 35 137 51 0 11 ...
> # $ total : int 26 31 416 393 183 43 143 97 29 22 ...
> # $ patient.type: Factor w/ 8 levels "","ADCA","ADCA I",..: 2 5 2 5 2
> 4 8 3 2 2 ...
> # $ age.group : Factor w/ 5 levels "","<40",">40",..: 5 5 5 5 1 1 4 5 1
> 1 ...
> # $ country : Factor w/ 19 levels "","Australia",..: 7 7 1 1 8 8 8 1 8
> 8 ...
> # $ region : Factor w/ 6 levels "Americas","Asia",..: 4 4 3 3 3 3
> 3 3 3 3 ...
>
> # It is straightforward to aggregate data:
>
> smart.sum <- function(x,...) {
> if (is.numeric(x)) return(sum(x, na.rm=T))
> else return(paste(unique(x), collapse=", ")) # for
> natbib citations
> }
> sp.ag1 = aggregate(sp, by=list(region=sp$region), FUN=smart.sum)
> sp.ag2 = aggregate(sp, by=list(patient.type=sp$patient.type), FUN=
> smart.sum)
> print(sp.ag1)
>
> # and even to calculate crude percentages
> spp = sp.ag1
> spp[,2:15] = sapply(spp[,2:15], FUN=function(x) { round(x*100/ spp$total,
> 1)})
>
> # *BUT*, the problem is that this is underestimating the true
> proportions, because some of the data is NA. This means the diagnosis
> was not tested rather than not found (which would be zero).
>
> # What I want to do is calculate a true proportion, with the number
> found divided by the number tested.
> # I have managed to do this:
>
> # calculates true proportion by only including values in the
> denominator that have non-NAs in the numerator
> # x= vector of numbers
> # total = vector of numbers
> calc.prevalence <- function(x, total) {
> sum.x = sum(x, na.rm=T) # calculate numerator
> sum.y = sum(total[!is.na(x)]) # calculate denominator
> return(sum.x/sum.y)
> }
>
> correct.prevalence <- calc.prevalence(sp$drpla, sp$total)
> incorrect.prevalence <- sum(sp$drpla, na.rm=T) / sum(sp$total, na.rm=T)
> print(correct.prevalence)
> print(incorrect.prevalence)
>
>
> This is easy to apply to one column in one table, but I'm finding it
> very difficult to do when I manipulate the data using the aggregate()
> function above.
>
> Is there a straightforward way, while aggregating columns by a
> variable (number of) factors, in generating the sum of a column and
> dividing by the sum of another column, only including data from the
> second column when the first column is not NA. AND is it possible to
> do that to all of the relevant columns (3:15)?
>
> You won't believe how many iterations of by(), aggregate(), tapply(),
> apply() and (finally) "for loops" I have tried with no success. The
> best partial solution involved a combination of by() calling a
> function calling aggregate(), but I can't parse the data returned. I'm
> sure I am missing something! Can anyone help?
>
> Many (many many) thanks!
>
> Best wishes,
>
> Mark
>
> P.S. this is enough to drive me to drink!
> --
> Dr. Mark Wardle
> Clinical research fellow and specialist registrar, Neurology
> Cardiff, UK
>
> ______________________________________________
> R-help_at_stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

	[[alternative HTML version deleted]]

______________________________________________
R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Tue 26 Jun 2007 - 22:46:22 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 27 Jun 2007 - 08:32:33 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.