Re: [Rd] Fastest non-overlapping binning mean function out there?

From: Martin Morgan <mtmorgan_at_fhcrc.org>
Date: Wed, 03 Oct 2012 13:50:07 -0700

On 10/3/2012 6:47 AM, Martin Morgan wrote:
> On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
>> Hi,
>>
>> I'm looking for a super-duper fast mean/sum binning implementation
>> available in R, and before implementing z = binnedMeans(x y) in native
>> code myself, does any one know of an existing function/package for
>> this? I'm sure it already exists. So, given data (x,y) and B bins
>> bx < bx < ... < bx[B] < bx[B+1], I'd like to calculate the
>> binned means (or sums) 'z' such that z = mean(x[bx <= x & x <
>> bx]), z = mean(x[bx <= x & x < bx]), .... z[B]. Let's
>> assume there are no missing values and 'x' and 'bx' is already
>> ordered. The length of 'x' is in the order of 10,000-millions. The
>> number of elements in each bin vary.
>
> since x and bx are ordered (sorting x would be expensive), the C code seems
> pretty straight-forward and memory-efficient -- create a result vector as long
> as bx, then iterate through x accumulating n and it's sum until x[i] >= bx[i].
> (I think R's implementation of mean() actually pays more attention to numerical
> error, but with this much data...)
>
> library(inline)
> binmean <- cfunction(signature(x="numeric", bx="numeric"),
> " int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;

I'll take my solution back. The problem specification says that x has 10,000-millions of elements, so we need to use R-devel and

R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n;

but there are two further issues. The first is that on my system

p\$ gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

I have __SIZEOF_SIZE_T__ 8 but

(a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I end up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using the inline package, to avoid that level of uncertainty) and then

> SEXP ans = PROTECT(NEW_NUMERIC(nb));
> double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans);
> sum = j = n = 0;
> for (i = 0; i < nx; ++i) {

(b) because nx is a signed type, and since nx > .Machine\$integer.max is represented as a negative number, I don't ever iterate this loop. So I'd have to be more clever if I wanted this to work on all platforms.

For what it's worth, Herve's solution is also problematic

> xx = findInterval(bx, x)
Error in findInterval(bx, x) : long vector 'vec' is not supported

A different strategy for the problem at hand would seem to involve iteration over sequences of x, collecting sufficient statistics (n, sum) for each iteration, and calculating the mean at the end of the day. This might also result in better memory use and allow parallel processing.

Martin

> while (xp[i] >= bxp[j]) {
> ansp[j++] = n > 0 ? sum / n : 0;
> sum = n = 0;
> }
> n += 1;
> sum += xp[i];
> }
> ansp[j] = n > 0 ? sum / n : 0;
> UNPROTECT(1);
> return ans;
> ")
>
> with a test case
>
> nx <- 4e7
> nb <- 1e3
> x <- sort(runif(nx))
> bx <- do.call(seq, c(as.list(range(x)), length.out=nb))
>
>
> > bx1 <- c(bx[-1], bx[nb] + 1)
> > system.time(res <- binmean(x, bx1))
> user system elapsed
> 0.052 0.000 0.050
>
> Martin
>
>>
>> Thanks,
>>
>> Henrik
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>

```--
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
```
Received on Wed 03 Oct 2012 - 20:56:09 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 04 Oct 2012 - 21:10:45 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.