Re: [R] Replacing for loop with tapply!?

From: Sander Oom <slist_at_oomvanlieshout.net>
Date: Sun 12 Jun 2005 - 22:37:12 EST

Your solution (the second function) is definitely the most elegant and generic solution of all replies in this discussion. Robust for missing values and flexible to allow as many calculations as desired! It is so clear, I even managed to hack it (of course also thanks to the new insight from all the other posts)!

As the data consists of weather stations in rows and days in columns, I have adapted the function to work on rows instead of columns. Did not manage to get the results directly into the right rows/cols layout, so a transpose (t) is still required. However this seems instant, so does not mean a reduction in speed! Calculating proportions is now a snip!!

Thanks for you help,

Sander.

### simulate data
set.seed(1) # for reproducibility mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10) mat[ mat > 45 ] <- NA # create some missing values mat[ 9, ] <- NA # station 9's data is completely missing mat

find.stats <- function( data, threshold ){

```   n      <- length(threshold)
excess <- numeric( n )
out    <- matrix( ncol=nrow(data), nrow=(n + 2) ) # initialise
good   <- which( apply( data, 1, function(x) !all(is.na(x)) ) )
```
# rows that are not completely missing

out[ ,good ] <- apply( data[ good, ], 1, function(x){

```     m <- max( x, na.rm=T )
# determine maximum value per row
c <- length(x[!is.na(x)])
# determine number of non-missing values
for(i in 1:n){ excess[i] <- sum( x > threshold[i], na.rm=TRUE
)/length(x[!is.na(x)]) }
# calc proportion of non-missing values over multiple thresholds
return( c(m, c, excess) )
```

} )

rownames(out) <- c( "TmpMax", "Count", paste("Over", threshold, sep="") )    colnames(out) <- rownames(data) # name of the stations    return( t(out) )
}

lstTemps=c(37,39,41,43)
tmp <- find.stats( mat, lstTemps )
tmp

> OK, so you want to find some summary statistics for each column, where
> some columns could be completely missing.
>
> Writing a small wrapper should help. When you use apply(), you are
> actually applying a function to every column (or row). First, let us
> simulate a dataset with 15 days/rows and 10 stations/columns
>
> ### simulate data
> set.seed(1) # for reproducibility
> mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
> mat[ mat > 45 ] <- NA # create some missing values
> mat[ ,9 ] <- NA # station 9's data is completely missing
>
>
> Here are two example of such wrappers :
>
> find.stats1 <- function( data, threshold=c(37,39,41) ){
>
> n <- length(threshold)
> out <- matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise
>
> out[1, ] <- apply(data, 2, function(x)
> ifelse( all(is.na(x)), NA, max(x, na.rm=T) ))
>
> for(i in 1:n) out[ i+1, ] <- colSums( data > threshold[i], na.rm=T )
>
> rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
> colnames(out) <- rownames(data) # name of the stations
> return( out )
> }
>
> find.stats2 <- function( data, threshold=c(37,39,41) ){
>
> n <- length(threshold)
> excess <- numeric( n )
> out <- matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise
> good <- which( apply( data, 2, function(x) !all(is.na(x)) ) )
> # colums that are not completely missing
>
> out[ , good] <- apply( data[ , good], 2, function(x){
> m <- max( x, na.rm=T )
> for(i in 1:n){ excess[i] <- sum( x > threshold[i], na.rm=TRUE ) }
> return( c(m, excess) )
> } )
>
> rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
> colnames(out) <- rownames(data) # name of the stations
> return( out )
> }
>
> find.stats1( mat )
> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> daily_max 44 42 39 41 45 43 42 45 NA 42
> above_37 2 1 2 1 3 2 2 1 0 1
> above_39 2 1 0 1 3 2 1 1 0 1
> above_41 2 1 0 0 2 2 1 1 0 1
>
> find.stats2( mat )
> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> daily_max 44 42 39 41 45 43 42 45 NA 42
> above_37 2 1 2 1 3 2 2 1 NA 1
> above_39 2 1 0 1 3 2 1 1 NA 1
> above_41 2 1 0 0 2 2 1 1 NA 1
>
>
> On my laptop 'find.stats1' and 'find.stats2' (which is more flexible)
> takes 7 and 6 seconds respectively to execute on a dataset with 10000
> stations and 365 days.
>
>
>
>
> On Fri, 2005-06-10 at 20:05 +0200, Sander Oom wrote:

```>>Dear all,
>>
>>Dimitris and Andy, thanks for your great help. I have progressed to the
>>following code which runs very fast and effective:
>>
>>mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
>>mat[mat>45] <- NA
>>mat<-NA
>>mat
>>temps <- c(35, 37, 39)
>>ind <- rbind(
>>     t(sapply(temps, function(temp)
>>       rowSums(mat > temp, na.rm=TRUE) )),
>>     rowSums(!is.na(mat), na.rm=FALSE),
>>     apply(mat, 1, max, na.rm=TRUE))
>>ind <- t(ind)
>>ind
>>
>>However, some weather stations have missing values for the whole year.
>>Unfortunately, the code breaks down (when uncommenting mat<-NA).
>>
>>I have tried 'ifelse' statements in the functions, but it becomes even
>>more of a mess. I could subset the matrix before hand, but this would
>>mean merging with a complete matrix afterwards to make it compatible
>>with other years. That would slow things down.
>>
>>How can I make the code robust for rows containing all missing values?
>>
>>
>>Sander.
>>
>>Dimitris Rizopoulos wrote:
>>>for the maximum you could use something like:
>>>
>>>ind[, 1] <- apply(mat, 2, max)
>>>
>>>I hope it helps.
>>>
>>>Best,
>>>Dimitris
>>>
>>>----
>>>Dimitris Rizopoulos
>>>Ph.D. Student
>>>Biostatistical Centre
>>>School of Public Health
>>>Catholic University of Leuven
>>>
>>>Tel: +32/16/336899
>>>Fax: +32/16/337015
>>>Web: http://www.med.kuleuven.ac.be/biostat/
>>>     http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
>>>
>>>
>>>
>>>----- Original Message -----
>>>From: "Sander Oom" <slist@oomvanlieshout.net>
>>>To: "Dimitris Rizopoulos" <dimitris.rizopoulos@med.kuleuven.be>
>>>Cc: <r-help@stat.math.ethz.ch>
>>>Sent: Friday, June 10, 2005 12:10 PM
>>>Subject: Re: [R] Replacing for loop with tapply!?
>>>
>>>
>>>>Thanks Dimitris,
>>>>
>>>>Very impressive! Much faster than before.
>>>>
>>>>Thanks to new found R.basic, I can simply rotate the result with
>>>>rotate270{R.basic}:
>>>>
>>>>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
>>>>>temps <- c(37, 39, 41)
>>>>>#################
>>>>>#ind <- matrix(0, length(temps), ncol(mat))
>>>>>ind <- matrix(0, 4, ncol(mat))
>>>>>(startDate <- date())
>>>>[1] "Fri Jun 10 12:08:01 2005"
>>>>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
>>>>>ind[4, ] <- colMeans(max(mat))
>>>>Error in colMeans(max(mat)) : 'x' must be an array of at least two
>>>>dimensions
>>>>>(endDate <- date())
>>>>[1] "Fri Jun 10 12:08:02 2005"
>>>>>ind <- rotate270(ind)
>>>>>ind[1:10,]
>>>>  V4 V3 V2 V1
>>>>1   0 56 75 80
>>>>2   0 46 53 60
>>>>3   0 50 58 67
>>>>4   0 60 72 80
>>>>5   0 59 68 76
>>>>6   0 55 67 74
>>>>7   0 62 77 93
>>>>8   0 45 57 67
>>>>9   0 57 68 75
>>>>10  0 61 66 76
>>>>
>>>>However, I have not managed to get the row maximum using your
>>>>method? It
>>>>should be 50 for most rows, but my first guess code gives an error!
>>>>
>>>>Any suggestions?
>>>>
>>>>Sander
>>>>
>>>>
>>>>
>>>>Dimitris Rizopoulos wrote:
>>>>>maybe you are looking for something along these lines:
>>>>>
>>>>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
>>>>>temps <- c(37, 39, 41)
>>>>>#################
>>>>>ind <- matrix(0, length(temps), ncol(mat))
>>>>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
>>>>>ind
>>>>>
>>>>>
>>>>>I hope it helps.
>>>>>
>>>>>Best,
>>>>>Dimitris
>>>>>
>>>>>----
>>>>>Dimitris Rizopoulos
>>>>>Ph.D. Student
>>>>>Biostatistical Centre
>>>>>School of Public Health
>>>>>Catholic University of Leuven
>>>>>
>>>>>Tel: +32/16/336899
>>>>>Fax: +32/16/337015
>>>>>Web: http://www.med.kuleuven.ac.be/biostat/
>>>>>    http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
>>>>>
>>>>>
>>>>>----- Original Message -----
>>>>>From: "Sander Oom" <slist@oomvanlieshout.net>
>>>>>To: <r-help@stat.math.ethz.ch>
>>>>>Sent: Friday, June 10, 2005 10:50 AM
>>>>>Subject: [R] Replacing for loop with tapply!?
>>>>>
>>>>>
>>>>>>Dear all,
>>>>>>
>>>>>>We have a large data set with temperature data for weather stations
>>>>>>across the globe (15000 stations).
>>>>>>
>>>>>>For each station, we need to calculate the number of days a certain
>>>>>>temperature is exceeded.
>>>>>>
>>>>>>So far we used the following S code, where mat88 is a matrix
>>>>>>containing
>>>>>>rows of 365 daily temperatures for each of 15000 weather stations:
>>>>>>
>>>>>>m <- 37
>>>>>>n <- 2
>>>>>>outmat88 <- matrix(0, ncol = 4, nrow = nrow(mat88))
>>>>>>for(i in 1:nrow(mat88)) {
>>>>>># i <- 3
>>>>>>row1 <- as.data.frame(df88[i,  ])
>>>>>>temprow37 <- select.rows(row1, row1 > m)
>>>>>>temprow39 <- select.rows(row1, row1 > m + n)
>>>>>>temprow41 <- select.rows(row1, row1 > m + 2 * n)
>>>>>>outmat88[i, 1] <- max(row1, na.rm = T)
>>>>>>outmat88[i, 2] <- count.rows(temprow37)
>>>>>>outmat88[i, 3] <- count.rows(temprow39)
>>>>>>outmat88[i, 4] <- count.rows(temprow41)
>>>>>>}
>>>>>>outmat88
>>>>>>
>>>>>>We have transferred the data to a more potent Linux box running R,
>>>>>>but
>>>>>>still hope to speed up the code.
>>>>>>
>>>>>>I know a for loop should be avoided when looking for speed. I also
>>>>>>know
>>>>>>the answer is in something like tapply, but my understanding of
>>>>>>these
>>>>>>commands is still to limited to see the solution. Could someone
>>>>>>show
>>>>>>me
>>>>>>the way!?
>>>>>>