From: jim holtman <jholtman_at_gmail.com>

Date: Tue, 26 Jun 2007 18:42:49 -0400

by(x, list(x$type), function(z){

Date: Tue, 26 Jun 2007 18:42:49 -0400

I think something like this will do it for you.

set.seed(1)

n <- 10

x <- data.frame(a=sample(1:100,n), b=sample(1:100,n),d=sample(1:100,n))

x$a[c(1,5)] <- NA x$b[c(7,6,4)] <- NA x$d[c(5,8)] <- NA x$total <- rowSums(x, na.rm=TRUE) x$type <- sample(1:3, n, replace=TRUE)

by(x, list(x$type), function(z){

apply(z[,1:3], 2, calc.prevalence, total=z$total) })

On 6/26/07, Mark Wardle <mark_at_wardle.org> wrote:

*>
*

> Dear all,

*>
**> I have been stuck on this problem, am rather struggling and would
**> appreciate some advice if anyone can help. I apologise if this is a
**> bit long-winded, but I've tried to limit it to the bare essentials,
**> but don't know how to make it more generic!
**>
**> I have some slightly odd real world data that I'm looking at
**> representing number of positive diagnoses for different diseases, plus
**> the number tested. It's all in a data frame:
**>
**> This should all be directly runnable with cut+paste:
**>
**> sp = read.csv('http://www.wardle.org/sca-prev.csv', header=T)[,c(-1,-2)]
**>
**> str(sp)
**>
**> #'data.frame': 46 obs. of 19 variables:
**> # $ sca1 : int 5 1 NA NA 48 1 NA 17 21 4 ...
**> # $ sca2 : int 7 3 NA NA 53 1 NA NA 7 7 ...
**> # $ sca3 : int 3 1 NA NA 2 0 NA 29 0 0 ...
**> # $ sca6 : int 1 0 NA NA 2 2 NA NA 0 0 ...
**> # $ sca7 : int NA NA NA NA 2 0 NA NA 1 0 ...
**> # $ sca8 : int NA NA NA NA 2 1 NA NA NA NA ...
**> # $ sca10 : int NA NA NA NA 0 0 NA NA NA NA ...
**> # $ sca12 : int NA NA NA NA 0 0 NA NA NA NA ...
**> # $ sca17 : int NA NA NA NA 2 1 NA NA NA NA ...
**> # $ frda : int NA NA NA NA 0 1 NA NA NA NA ...
**> # $ drpla : int NA NA 1 1 1 1 NA NA 0 0 ...
**> # $ fmr1 : int NA NA NA NA NA NA 6 NA NA NA ...
**> # $ diagnosis : int 16 5 1 1 112 8 6 46 29 11 ...
**> # $ unexplained : int 10 26 415 392 71 35 137 51 0 11 ...
**> # $ total : int 26 31 416 393 183 43 143 97 29 22 ...
**> # $ patient.type: Factor w/ 8 levels "","ADCA","ADCA I",..: 2 5 2 5 2
**> 4 8 3 2 2 ...
**> # $ age.group : Factor w/ 5 levels "","<40",">40",..: 5 5 5 5 1 1 4 5 1
**> 1 ...
**> # $ country : Factor w/ 19 levels "","Australia",..: 7 7 1 1 8 8 8 1 8
**> 8 ...
**> # $ region : Factor w/ 6 levels "Americas","Asia",..: 4 4 3 3 3 3
**> 3 3 3 3 ...
**>
**> # It is straightforward to aggregate data:
**>
**> smart.sum <- function(x,...) {
**> if (is.numeric(x)) return(sum(x, na.rm=T))
**> else return(paste(unique(x), collapse=", ")) # for
**> natbib citations
**> }
**> sp.ag1 = aggregate(sp, by=list(region=sp$region), FUN=smart.sum)
**> sp.ag2 = aggregate(sp, by=list(patient.type=sp$patient.type), FUN=
**> smart.sum)
**> print(sp.ag1)
**>
**> # and even to calculate crude percentages
**> spp = sp.ag1
**> spp[,2:15] = sapply(spp[,2:15], FUN=function(x) { round(x*100/ spp$total,
**> 1)})
**>
**> # *BUT*, the problem is that this is underestimating the true
**> proportions, because some of the data is NA. This means the diagnosis
**> was not tested rather than not found (which would be zero).
**>
**> # What I want to do is calculate a true proportion, with the number
**> found divided by the number tested.
**> # I have managed to do this:
**>
**> # calculates true proportion by only including values in the
**> denominator that have non-NAs in the numerator
**> # x= vector of numbers
**> # total = vector of numbers
**> calc.prevalence <- function(x, total) {
**> sum.x = sum(x, na.rm=T) # calculate numerator
**> sum.y = sum(total[!is.na(x)]) # calculate denominator
**> return(sum.x/sum.y)
**> }
**>
**> correct.prevalence <- calc.prevalence(sp$drpla, sp$total)
**> incorrect.prevalence <- sum(sp$drpla, na.rm=T) / sum(sp$total, na.rm=T)
**> print(correct.prevalence)
**> print(incorrect.prevalence)
**>
**>
**> This is easy to apply to one column in one table, but I'm finding it
**> very difficult to do when I manipulate the data using the aggregate()
**> function above.
**>
**> Is there a straightforward way, while aggregating columns by a
**> variable (number of) factors, in generating the sum of a column and
**> dividing by the sum of another column, only including data from the
**> second column when the first column is not NA. AND is it possible to
**> do that to all of the relevant columns (3:15)?
**>
**> You won't believe how many iterations of by(), aggregate(), tapply(),
**> apply() and (finally) "for loops" I have tried with no success. The
**> best partial solution involved a combination of by() calling a
**> function calling aggregate(), but I can't parse the data returned. I'm
**> sure I am missing something! Can anyone help?
**>
**> Many (many many) thanks!
**>
**> Best wishes,
**>
**> Mark
**>
**> P.S. this is enough to drive me to drink!
**> --
**> Dr. Mark Wardle
**> Clinical research fellow and specialist registrar, Neurology
**> Cardiff, UK
**>
**> ______________________________________________
**> R-help_at_stat.math.ethz.ch mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide
**> http://www.R-project.org/posting-guide.html
**> and provide commented, minimal, self-contained, reproducible code.
**>
*

-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? [[alternative HTML version deleted]] ______________________________________________ R-help_at_stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.Received on Tue 26 Jun 2007 - 22:46:22 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Wed 27 Jun 2007 - 08:32:33 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*