Re: [Rd] Fastest non-overlapping binning mean function out there?

From: Henrik Bengtsson <hb_at_biostat.ucsf.edu>
Date: Thu, 04 Oct 2012 10:45:30 -1000

Thank you all - I appreciate all your suggestions. My post was mainly to check if there's already an existing and fast implementation out there. I've ended up adopting Martin's first proposal. Something like this:

```#include <Rdefines.h>
#include <R.h>
#include <R_ext/Error.h>

```

SEXP binMeans(SEXP y, SEXP x, SEXP bx, SEXP retCount) {   int ny = Rf_length(y), nx = Rf_length(x), nb = Rf_length(bx)-1;   double *yp = REAL(y), *xp = REAL(x), *bxp = REAL(bx);   SEXP ans = PROTECT(NEW_NUMERIC(nb));
double *ansp = REAL(ans);
int retcount = LOGICAL(retCount);
SEXP count = NULL;
int *countp = NULL;
int i = 0, j = 0, n = 0, iStart=0;
double sum = 0.0;

// Assert same lengths of 'x' and 'y'
if (ny != nx) {
error("Argument 'y' and 'x' are of different lengths: %d != %d", ny, nx);
}

if (retcount) {
count = PROTECT(NEW_INTEGER(nb));
countp = INTEGER(count);
}

while (xp[iStart] < bxp) {
++iStart;
}

// For each x...
for (i = iStart; i < nx; ++i) {
while (xp[i] >= bxp[j+1]) {

```      // Update statistic of current bin?
if (retcount) { countp[j] = n; }
ansp[j] = n > 0 ? sum / n : 0;
sum = 0.0;
n = 0;
// ...and move to next
++j;
```

}
sum += yp[i];
n += 1;
}

// Update statistic of last bin
ansp[j] = n > 0 ? sum / n : 0;
if (retcount) {
countp[j] = n;
setAttrib(ans, install("count"), count);     UNPROTECT(1); // 'count'
}

UNPROTECT(1); // 'ans'

return ans;
} // binMeans()

BTW, "10,000-millions" was supposed to read as a range (10^4 to 10^6) and not a scalar (10^10), but the misinterpretation got Martin to point out some of the exciting work extending R core to support "very large" integers. And, as Herve pointed out, I forgot to say that the number of bins could be of that range as well, or maybe an order of magnitude less (say ~1-100 data points per bin).

/Henrik

On Wed, Oct 3, 2012 at 10:50 AM, Martin Morgan <mtmorgan_at_fhcrc.org> wrote:

```> On 10/3/2012 6:47 AM, Martin Morgan wrote:
>>
>> On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
>>>
>>> Hi,
>>>
>>> I'm looking for a super-duper fast mean/sum binning implementation
>>> available in R, and before implementing z = binnedMeans(x y) in native
>>> code myself, does any one know of an existing function/package for
>>> this?  I'm sure it already exists.  So, given data (x,y) and B bins
>>> bx < bx < ... < bx[B] < bx[B+1], I'd like to calculate the
>>> binned means (or sums) 'z' such that z = mean(x[bx <= x & x <
>>> bx]), z = mean(x[bx <= x & x < bx]), .... z[B].  Let's
>>> assume there are no missing values and 'x' and 'bx' is already
>>> ordered.  The length of 'x' is in the order of 10,000-millions.  The
>>> number of elements in each bin vary.
>>
>>
>> since x and bx are ordered (sorting x would be expensive), the C code
>> seems
>> pretty straight-forward and memory-efficient -- create a result vector as
>> long
>> as bx, then iterate through x accumulating n and it's sum until x[i] >=
>> bx[i].
>> (I think R's implementation of mean() actually pays more attention to
>> numerical
>> error, but with this much data...)
>>
>> library(inline)
>> binmean <- cfunction(signature(x="numeric", bx="numeric"),
>> "   int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;
>
>
> I'll take my solution back. The problem specification says that x has
> 10,000-millions of elements, so we need to use R-devel and
>
>     R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n;
>
> but there are two further issues. The first is that on my system
>
> p\$ gcc --version
> gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
>
> I have __SIZEOF_SIZE_T__ 8 but
>
> (a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I
> end up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using
> the inline package, to avoid that level of uncertainty) and then
>
>
>>      SEXP ans = PROTECT(NEW_NUMERIC(nb));
>>      double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans);
>>      sum = j = n = 0;
>>      for (i = 0; i < nx; ++i) {
>
>
> (b) because nx is a signed type, and since nx > .Machine\$integer.max is
> represented as a negative number, I don't ever iterate this loop. So I'd
> have to be more clever if I wanted this to work on all platforms.
>
> For what it's worth, Herve's solution is also problematic
>
>> xx = findInterval(bx, x)
> Error in findInterval(bx, x) : long vector 'vec' is not supported
>
> A different strategy for the problem at hand would seem to involve iteration
> over sequences of x, collecting sufficient statistics (n, sum) for each
> iteration, and calculating the mean at the end of the day. This might also
> result in better memory use and allow parallel processing.
>
> Martin
>
>
>>          while (xp[i] >= bxp[j]) {
>>               ansp[j++] = n > 0 ? sum / n : 0;
>>               sum = n = 0;
>>          }
>>          n += 1;
>>          sum += xp[i];
>>      }
>>      ansp[j] = n > 0 ? sum / n : 0;
>>      UNPROTECT(1);
>>      return ans;
>> ")
>>
>> with a test case
>>
>> nx <- 4e7
>> nb <- 1e3
>> x <- sort(runif(nx))
>> bx <- do.call(seq, c(as.list(range(x)), length.out=nb))
>>
>>
>>  > bx1 <- c(bx[-1], bx[nb] + 1)
>>  > system.time(res <- binmean(x, bx1))
>>     user  system elapsed
>>    0.052   0.000   0.050
>>
>> Martin
>>
>>>
>>> Thanks,
>>>
>>> Henrik
>>>
>>> ______________________________________________
>>> R-devel_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>
>
> --
> Dr. Martin Morgan, PhD
>
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109

______________________________________________
```
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu 04 Oct 2012 - 20:48:01 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 04 Oct 2012 - 21:30:45 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.