From: Mark Wardle <mark_at_wardle.org>

Date: Tue, 26 Jun 2007 20:14:00 +0100

4 8 3 2 2 ...

*# $ age.group : Factor w/ 5 levels "","<40",">40",..: 5 5 5 5 1 1 4 5 1 1 ...
*

# $ country : Factor w/ 19 levels "","Australia",..: 7 7 1 1 8 8 8 1 8 8 ...

# $ region : Factor w/ 6 levels "Americas","Asia",..: 4 4 3 3 3 3

3 3 3 3 ...

sp.ag1 = aggregate(sp, by=list(region=sp$region), FUN=smart.sum) sp.ag2 = aggregate(sp, by=list(patient.type=sp$patient.type), FUN=smart.sum) print(sp.ag1)

}

Date: Tue, 26 Jun 2007 20:14:00 +0100

This should all be directly runnable with cut+paste:

sp = read.csv('http://www.wardle.org/sca-prev.csv', header=T)[,c(-1,-2)]

str(sp)

#'data.frame': 46 obs. of 19 variables:# $ patient.type: Factor w/ 8 levels "","ADCA","ADCA I",..: 2 5 2 5 2

# $ sca1 : int 5 1 NA NA 48 1 NA 17 21 4 ...# $ sca2 : int 7 3 NA NA 53 1 NA NA 7 7 ...# $ sca3 : int 3 1 NA NA 2 0 NA 29 0 0 ...# $ sca6 : int 1 0 NA NA 2 2 NA NA 0 0 ...# $ sca7 : int NA NA NA NA 2 0 NA NA 1 0 ...# $ sca8 : int NA NA NA NA 2 1 NA NA NA NA ...# $ sca10 : int NA NA NA NA 0 0 NA NA NA NA ...# $ sca12 : int NA NA NA NA 0 0 NA NA NA NA ...# $ sca17 : int NA NA NA NA 2 1 NA NA NA NA ...# $ frda : int NA NA NA NA 0 1 NA NA NA NA ...# $ drpla : int NA NA 1 1 1 1 NA NA 0 0 ...# $ fmr1 : int NA NA NA NA NA NA 6 NA NA NA ...# $ diagnosis : int 16 5 1 1 112 8 6 46 29 11 ...# $ unexplained : int 10 26 415 392 71 35 137 51 0 11 ...

# $ total : int 26 31 416 393 183 43 143 97 29 22 ...

4 8 3 2 2 ...

# $ country : Factor w/ 19 levels "","Australia",..: 7 7 1 1 8 8 8 1 8 8 ...

# $ region : Factor w/ 6 levels "Americas","Asia",..: 4 4 3 3 3 3

3 3 3 3 ...

# It is straightforward to aggregate data:

smart.sum <- function(x,...) {

if (is.numeric(x)) return(sum(x, na.rm=T)) else return(paste(unique(x), collapse=", ")) # for natbib citations}

sp.ag1 = aggregate(sp, by=list(region=sp$region), FUN=smart.sum) sp.ag2 = aggregate(sp, by=list(patient.type=sp$patient.type), FUN=smart.sum) print(sp.ag1)

# and even to calculate crude percentages

spp = sp.ag1

spp[,2:15] = sapply(spp[,2:15], FUN=function(x) { round(x*100/ spp$total, 1)})

# *BUT*, the problem is that this is underestimating the true

proportions, because some of the data is NA. This means the diagnosis
was not tested rather than not found (which would be zero).

# What I want to do is calculate a true proportion, with the number

found divided by the number tested.

# I have managed to do this:

# calculates true proportion by only including values in the

denominator that have non-NAs in the numerator

# x= vector of numbers

# total = vector of numbers

calc.prevalence <- function(x, total) { sum.x = sum(x, na.rm=T) # calculate numerator sum.y = sum(total[!is.na(x)]) # calculate denominator return(sum.x/sum.y)

}

correct.prevalence <- calc.prevalence(sp$drpla, sp$total)
incorrect.prevalence <- sum(sp$drpla, na.rm=T) / sum(sp$total, na.rm=T)
print(correct.prevalence)

print(incorrect.prevalence)

This is easy to apply to one column in one table, but I'm finding it very difficult to do when I manipulate the data using the aggregate() function above.

Is there a straightforward way, while aggregating columns by a variable (number of) factors, in generating the sum of a column and dividing by the sum of another column, only including data from the second column when the first column is not NA. AND is it possible to do that to all of the relevant columns (3:15)?

You won't believe how many iterations of by(), aggregate(), tapply(), apply() and (finally) "for loops" I have tried with no success. The best partial solution involved a combination of by() calling a function calling aggregate(), but I can't parse the data returned. I'm sure I am missing something! Can anyone help?

Many (many many) thanks!

Best wishes,

Mark

P.S. this is enough to drive me to drink!

-- Dr. Mark Wardle Clinical research fellow and specialist registrar, Neurology Cardiff, UK ______________________________________________ R-help_at_stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.Received on Tue 26 Jun 2007 - 19:49:31 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Wed 27 Jun 2007 - 00:32:35 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*