From: Henrik Bengtsson <hb_at_biostat.ucsf.edu>

Date: Thu, 04 Oct 2012 10:45:30 -1000

}

sum += yp[i];

n += 1;

* }
*

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu 04 Oct 2012 - 20:48:01 GMT

Date: Thu, 04 Oct 2012 10:45:30 -1000

Thank you all - I appreciate all your suggestions. My post was mainly to check if there's already an existing and fast implementation out there. I've ended up adopting Martin's first proposal. Something like this:

#include <Rdefines.h> #include <R.h> #include <R_ext/Error.h>

SEXP binMeans(SEXP y, SEXP x, SEXP bx, SEXP retCount) {
int ny = Rf_length(y), nx = Rf_length(x), nb = Rf_length(bx)-1;
double *yp = REAL(y), *xp = REAL(x), *bxp = REAL(bx);
SEXP ans = PROTECT(NEW_NUMERIC(nb));

double *ansp = REAL(ans);

int retcount = LOGICAL(retCount)[0];

SEXP count = NULL;

int *countp = NULL;

int i = 0, j = 0, n = 0, iStart=0;

double sum = 0.0;

// Assert same lengths of 'x' and 'y'

if (ny != nx) {

error("Argument 'y' and 'x' are of different lengths: %d != %d", ny, nx);

* }
*

if (retcount) {

count = PROTECT(NEW_INTEGER(nb));

countp = INTEGER(count);

* }
*

// Skip to the first bin

while (xp[iStart] < bxp[0]) {

++iStart;

* }
*

// For each x...

for (i = iStart; i < nx; ++i) {

// Skip to a new bin?

while (xp[i] >= bxp[j+1]) {

// Update statistic of current bin? if (retcount) { countp[j] = n; } ansp[j] = n > 0 ? sum / n : 0; sum = 0.0; n = 0; // ...and move to next ++j;

}

sum += yp[i];

n += 1;

// Update statistic of last bin

ansp[j] = n > 0 ? sum / n : 0;

if (retcount) {

countp[j] = n;

setAttrib(ans, install("count"), count);
UNPROTECT(1); // 'count'

* }
*

UNPROTECT(1); // 'ans'

return ans;

} // binMeans()

BTW, "10,000-millions" was supposed to read as a range (10^4 to 10^6) and not a scalar (10^10), but the misinterpretation got Martin to point out some of the exciting work extending R core to support "very large" integers. And, as Herve pointed out, I forgot to say that the number of bins could be of that range as well, or maybe an order of magnitude less (say ~1-100 data points per bin).

/Henrik

On Wed, Oct 3, 2012 at 10:50 AM, Martin Morgan <mtmorgan_at_fhcrc.org> wrote:

> On 10/3/2012 6:47 AM, Martin Morgan wrote: >> >> On 10/02/2012 05:19 PM, Henrik Bengtsson wrote: >>> >>> Hi, >>> >>> I'm looking for a super-duper fast mean/sum binning implementation >>> available in R, and before implementing z = binnedMeans(x y) in native >>> code myself, does any one know of an existing function/package for >>> this? I'm sure it already exists. So, given data (x,y) and B bins >>> bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the >>> binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < >>> bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's >>> assume there are no missing values and 'x' and 'bx' is already >>> ordered. The length of 'x' is in the order of 10,000-millions. The >>> number of elements in each bin vary. >> >> >> since x and bx are ordered (sorting x would be expensive), the C code >> seems >> pretty straight-forward and memory-efficient -- create a result vector as >> long >> as bx, then iterate through x accumulating n and it's sum until x[i] >= >> bx[i]. >> (I think R's implementation of mean() actually pays more attention to >> numerical >> error, but with this much data...) >> >> library(inline) >> binmean <- cfunction(signature(x="numeric", bx="numeric"), >> " int nx = Rf_length(x), nb = Rf_length(bx), i, j, n; > > > I'll take my solution back. The problem specification says that x has > 10,000-millions of elements, so we need to use R-devel and > > R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n; > > but there are two further issues. The first is that on my system > > p$ gcc --version > gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 > > I have __SIZEOF_SIZE_T__ 8 but > > (a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I > end up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using > the inline package, to avoid that level of uncertainty) and then > > >> SEXP ans = PROTECT(NEW_NUMERIC(nb)); >> double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans); >> sum = j = n = 0; >> for (i = 0; i < nx; ++i) { > > > (b) because nx is a signed type, and since nx > .Machine$integer.max is > represented as a negative number, I don't ever iterate this loop. So I'd > have to be more clever if I wanted this to work on all platforms. > > For what it's worth, Herve's solution is also problematic > >> xx = findInterval(bx, x) > Error in findInterval(bx, x) : long vector 'vec' is not supported > > A different strategy for the problem at hand would seem to involve iteration > over sequences of x, collecting sufficient statistics (n, sum) for each > iteration, and calculating the mean at the end of the day. This might also > result in better memory use and allow parallel processing. > > Martin > > >> while (xp[i] >= bxp[j]) { >> ansp[j++] = n > 0 ? sum / n : 0; >> sum = n = 0; >> } >> n += 1; >> sum += xp[i]; >> } >> ansp[j] = n > 0 ? sum / n : 0; >> UNPROTECT(1); >> return ans; >> ") >> >> with a test case >> >> nx <- 4e7 >> nb <- 1e3 >> x <- sort(runif(nx)) >> bx <- do.call(seq, c(as.list(range(x)), length.out=nb)) >> >> leading to >> >> > bx1 <- c(bx[-1], bx[nb] + 1) >> > system.time(res <- binmean(x, bx1)) >> user system elapsed >> 0.052 0.000 0.050 >> >> Martin >> >>> >>> Thanks, >>> >>> Henrik >>> >>> ______________________________________________ >>> R-devel_at_r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> >> > > > -- > Dr. Martin Morgan, PhD > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 ______________________________________________R-devel_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu 04 Oct 2012 - 20:48:01 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

*
Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.
Archive generated by hypermail 2.2.0, at Thu 04 Oct 2012 - 21:30:45 GMT.
*

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel.
Please read the posting
guide before posting to the list.
*