Re: [R] Replacing for loop with tapply!?

From: Adaikalavan Ramasamy <ramasamy_at_cancer.org.uk>
Date: Sat 11 Jun 2005 - 10:36:43 EST

OK, so you want to find some summary statistics for each column, where some columns could be completely missing.

Writing a small wrapper should help. When you use apply(), you are actually applying a function to every column (or row). First, let us simulate a dataset with 15 days/rows and 10 stations/columns

### simulate data
set.seed(1) # for reproducibility mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10) mat[ mat > 45 ] <- NA # create some missing values mat[ ,9 ] <- NA # station 9's data is completely missing

Here are two example of such wrappers :

find.stats1 <- function( data, threshold=c(37,39,41) ){   

  n <- length(threshold)
  out <- matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise

  out[1, ] <- apply(data, 2, function(x)

                         ifelse( all(is.na(x)), NA, max(x, na.rm=T) ))

  for(i in 1:n) out[ i+1, ] <- colSums( data > threshold[i], na.rm=T )   

  rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
  colnames(out) <- rownames(data)  # name of the stations
  return( out )

}   

find.stats2 <- function( data, threshold=c(37,39,41) ){   

  n      <- length(threshold)
  excess <- numeric( n )
  out    <- matrix(  nrow=(n + 1), ncol=ncol(data) ) # initialise
  good   <- which( apply( data, 2, function(x) !all(is.na(x)) ) )
  # colums that are not completely missing  

  out[ , good] <- apply( data[ , good], 2, function(x){     m <- max( x, na.rm=T )
    for(i in 1:n){ excess[i] <- sum( x > threshold[i], na.rm=TRUE ) }     return( c(m, excess) )
  } )   

  rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
  colnames(out) <- rownames(data)  # name of the stations
  return( out )

}

find.stats1( mat )

          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

daily_max   44   42   39   41   45   43   42   45   NA    42
above_37     2    1    2    1    3    2    2    1    0     1
above_39     2    1    0    1    3    2    1    1    0     1
above_41     2    1    0    0    2    2    1    1    0     1

find.stats2( mat )
          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
daily_max   44   42   39   41   45   43   42   45   NA    42
above_37     2    1    2    1    3    2    2    1   NA     1
above_39     2    1    0    1    3    2    1    1   NA     1
above_41     2    1    0    0    2    2    1    1   NA     1


On my laptop 'find.stats1' and 'find.stats2' (which is more flexible) takes 7 and 6 seconds respectively to execute on a dataset with 10000 stations and 365 days.

Regards, Adai

On Fri, 2005-06-10 at 20:05 +0200, Sander Oom wrote:
> Dear all,
>
> Dimitris and Andy, thanks for your great help. I have progressed to the
> following code which runs very fast and effective:
>
> mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
> mat[mat>45] <- NA
> mat<-NA
> mat
> temps <- c(35, 37, 39)
> ind <- rbind(

> t(sapply(temps, function(temp)
> rowSums(mat > temp, na.rm=TRUE) )),
> rowSums(!is.na(mat), na.rm=FALSE),
> apply(mat, 1, max, na.rm=TRUE))
> ind <- t(ind)
> ind
>
> However, some weather stations have missing values for the whole year.
> Unfortunately, the code breaks down (when uncommenting mat<-NA).
>
> I have tried 'ifelse' statements in the functions, but it becomes even
> more of a mess. I could subset the matrix before hand, but this would
> mean merging with a complete matrix afterwards to make it compatible
> with other years. That would slow things down.
>
> How can I make the code robust for rows containing all missing values?
>
> Thanks for your help,
>
> Sander.
>
> Dimitris Rizopoulos wrote:
> > for the maximum you could use something like:
> >
> > ind[, 1] <- apply(mat, 2, max)
> >
> > I hope it helps.
> >
> > Best,
> > Dimitris
> >
> > ----
> > Dimitris Rizopoulos
> > Ph.D. Student
> > Biostatistical Centre
> > School of Public Health
> > Catholic University of Leuven
> >
> > Address: Kapucijnenvoer 35, Leuven, Belgium
> > Tel: +32/16/336899
> > Fax: +32/16/337015
> > Web: http://www.med.kuleuven.ac.be/biostat/
> > http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
> >
> >
> >
> > ----- Original Message -----
> > From: "Sander Oom" <slist@oomvanlieshout.net>
> > To: "Dimitris Rizopoulos" <dimitris.rizopoulos@med.kuleuven.be>
> > Cc: <r-help@stat.math.ethz.ch>
> > Sent: Friday, June 10, 2005 12:10 PM
> > Subject: Re: [R] Replacing for loop with tapply!?
> >
> >
> >>Thanks Dimitris,
> >>
> >>Very impressive! Much faster than before.
> >>
> >>Thanks to new found R.basic, I can simply rotate the result with
> >>rotate270{R.basic}:
> >>
> >>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
> >>>temps <- c(37, 39, 41)
> >>>#################
> >>>#ind <- matrix(0, length(temps), ncol(mat))
> >>>ind <- matrix(0, 4, ncol(mat))
> >>>(startDate <- date())
> >>[1] "Fri Jun 10 12:08:01 2005"
> >>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
> >>>ind[4, ] <- colMeans(max(mat))
> >>Error in colMeans(max(mat)) : 'x' must be an array of at least two
> >>dimensions
> >>>(endDate <- date())
> >>[1] "Fri Jun 10 12:08:02 2005"
> >>>ind <- rotate270(ind)
> >>>ind[1:10,]
> >> V4 V3 V2 V1
> >>1 0 56 75 80
> >>2 0 46 53 60
> >>3 0 50 58 67
> >>4 0 60 72 80
> >>5 0 59 68 76
> >>6 0 55 67 74
> >>7 0 62 77 93
> >>8 0 45 57 67
> >>9 0 57 68 75
> >>10 0 61 66 76
> >>
> >>However, I have not managed to get the row maximum using your
> >>method? It
> >>should be 50 for most rows, but my first guess code gives an error!
> >>
> >>Any suggestions?
> >>
> >>Sander
> >>
> >>
> >>
> >>Dimitris Rizopoulos wrote:
> >>>maybe you are looking for something along these lines:
> >>>
> >>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
> >>>temps <- c(37, 39, 41)
> >>>#################
> >>>ind <- matrix(0, length(temps), ncol(mat))
> >>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
> >>>ind
> >>>
> >>>
> >>>I hope it helps.
> >>>
> >>>Best,
> >>>Dimitris
> >>>
> >>>----
> >>>Dimitris Rizopoulos
> >>>Ph.D. Student
> >>>Biostatistical Centre
> >>>School of Public Health
> >>>Catholic University of Leuven
> >>>
> >>>Address: Kapucijnenvoer 35, Leuven, Belgium
> >>>Tel: +32/16/336899
> >>>Fax: +32/16/337015
> >>>Web: http://www.med.kuleuven.ac.be/biostat/
> >>> http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
> >>>
> >>>
> >>>----- Original Message -----
> >>>From: "Sander Oom" <slist@oomvanlieshout.net>
> >>>To: <r-help@stat.math.ethz.ch>
> >>>Sent: Friday, June 10, 2005 10:50 AM
> >>>Subject: [R] Replacing for loop with tapply!?
> >>>
> >>>
> >>>>Dear all,
> >>>>
> >>>>We have a large data set with temperature data for weather stations
> >>>>across the globe (15000 stations).
> >>>>
> >>>>For each station, we need to calculate the number of days a certain
> >>>>temperature is exceeded.
> >>>>
> >>>>So far we used the following S code, where mat88 is a matrix
> >>>>containing
> >>>>rows of 365 daily temperatures for each of 15000 weather stations:
> >>>>
> >>>>m <- 37
> >>>>n <- 2
> >>>>outmat88 <- matrix(0, ncol = 4, nrow = nrow(mat88))
> >>>>for(i in 1:nrow(mat88)) {
> >>>># i <- 3
> >>>>row1 <- as.data.frame(df88[i, ])
> >>>>temprow37 <- select.rows(row1, row1 > m)
> >>>>temprow39 <- select.rows(row1, row1 > m + n)
> >>>>temprow41 <- select.rows(row1, row1 > m + 2 * n)
> >>>>outmat88[i, 1] <- max(row1, na.rm = T)
> >>>>outmat88[i, 2] <- count.rows(temprow37)
> >>>>outmat88[i, 3] <- count.rows(temprow39)
> >>>>outmat88[i, 4] <- count.rows(temprow41)
> >>>>}
> >>>>outmat88
> >>>>
> >>>>We have transferred the data to a more potent Linux box running R,
> >>>>but
> >>>>still hope to speed up the code.
> >>>>
> >>>>I know a for loop should be avoided when looking for speed. I also
> >>>>know
> >>>>the answer is in something like tapply, but my understanding of
> >>>>these
> >>>>commands is still to limited to see the solution. Could someone
> >>>>show
> >>>>me
> >>>>the way!?
> >>>>
> >>>>Thanks in advance,
> >>>>
> >>>>Sander.
> >>>>--
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Jun 11 11:14:18 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:32:31 EST