Re: [Rd] Any interest in "merge" and "by" implementations specifically for so

From: tshort <tshort_at_eprisolutions.com>
Date: Sun 30 Jul 2006 - 12:02:19 GMT

Kevin, starting with your idea of sorting first, you can get some speedups just using R. Start by comparing the base case that Bill used:

> x <- runif(2e6)
> i <- rep(1:1e6, 2)
> unix.time(res0 <- unlist(lapply(split(x,i), sum)))
[1] 11.00 0.16 11.28 NA NA

Now, try sorting and using a loop:

> idx <- order(i)
> xs <- x[idx]
> is <- i[idx]
> res <- array(NA, 1e6)
> idx <- which(diff(is) > 0)
> startidx <- c(1, idx+1)
> endidx <- c(idx, length(xs))
> f1 <- function(x, startidx, endidx, FUN = sum) {

+   for (j in 1:length(res)) {
+     res[j] <- FUN(x[startidx[j]:endidx[j]])
+   }
+   res
+ }

> unix.time(res1 <- f1(xs, startidx, endidx))
[1] 6.86 0.00 7.04 NA NA

For the case of sum (or averages), you can vectorize this using cumsum as follows. This won't work for median or max.

> f2 <- function(x, startidx, endidx) {

+   cum <- cumsum(x)
+   res <- cum[endidx]
+   res[2:length(res)] <- res[2:length(res)] - cum[endidx[1:(length(res) -
1)]]
+ res
+ }
> unix.time(res2 <- f2(xs, startidx, endidx))
[1] 0.20 0.00 0.21 NA NA

You can also use Luke Tierney's byte compiler (http://www.stat.uiowa.edu/~luke/R/compiler/) to speed up the loop for functions where you can't vectorize:

> library(compiler)
> f3 <- cmpfun(f1)

Note: local functions used: FUN
> unix.time(res3 <- f3(xs, startidx, endidx))
[1] 3.84 0.00 3.91 NA NA

-- 
View this message in context: http://www.nabble.com/Any-interest-in-%22merge%22-and-%22by%22-implementations-specifically-for-sorted-data--tf2009595.html#a5562816
Sent from the R devel forum at Nabble.com.

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Sun Jul 30 22:07:15 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Mon 31 Jul 2006 - 14:27:48 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.