Re: [Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

From: Kevin B. Hendricks <kevin.hendricks_at_sympatico.ca>
Date: Mon 31 Jul 2006 - 21:19:52 GMT

Hi Thomas,

Here is a comparison of performance times from my own igroupSums versus using split and rowsum:

> x <- rnorm(2e6)
> i <- rep(1:1e6,2)
>
> unix.time(suma <- unlist(lapply(split(x,i),sum)))
[1] 8.188 0.076 8.263 0.000 0.000
>
> names(suma)<- NULL
>
> unix.time(sumb <- igroupSums(x,i))

[1] 0.036 0.000 0.035 0.000 0.000
>
> all.equal(suma, sumb)

[1] TRUE
>
> unix.time(sumc <- rowsum(x,i))

[1] 0.744 0.000 0.742 0.000 0.000
>
> sumc <- sumc[,1]
> names(sumc)<-NULL
> all.equal(suma,sumc)

[1] TRUE So my implementation of igroupSums is faster and already handles NA. I also have implemented igroupMins, igroupMaxs, igroupAnys, igroupAlls, igroupCounts, igroupMeans, and igroupRanges.

The igroup functions I implemented do not handle weights yet but do handle NAs properly.

Assuming I clean them up, is anyone in the R developer group interested?

Or would you rather I instead extend the rowsum appropach to create rowcount, rowmax, rowmin, rowcount, etc using a hash function approach.

All of these approaches simply use differently ways to map group codes to integers and then do the functions the same.

Thanks,

Kevin



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue Aug 01 08:28:47 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 01 Aug 2006 - 16:28:03 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.