From: Martin Morgan <mtmorgan_at_fhcrc.org>

Date: Wed, 03 Oct 2012 06:47:11 -0700

")

Date: Wed, 03 Oct 2012 06:47:11 -0700

On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:

*> Hi,
**>
*

> I'm looking for a super-duper fast mean/sum binning implementation

*> available in R, and before implementing z = binnedMeans(x y) in native
**> code myself, does any one know of an existing function/package for
**> this? I'm sure it already exists. So, given data (x,y) and B bins
**> bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the
**> binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x <
**> bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's
**> assume there are no missing values and 'x' and 'bx' is already
**> ordered. The length of 'x' is in the order of 10,000-millions. The
**> number of elements in each bin vary.
*

since x and bx are ordered (sorting x would be expensive), the C code seems pretty straight-forward and memory-efficient -- create a result vector as long as bx, then iterate through x accumulating n and it's sum until x[i] >= bx[i]. (I think R's implementation of mean() actually pays more attention to numerical error, but with this much data...)

library(inline)

binmean <- cfunction(signature(x="numeric", bx="numeric"),
" int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;

SEXP ans = PROTECT(NEW_NUMERIC(nb)); double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans); sum = j = n = 0; for (i = 0; i < nx; ++i) { while (xp[i] >= bxp[j]) { ansp[j++] = n > 0 ? sum / n : 0; sum = n = 0; } n += 1; sum += xp[i]; } ansp[j] = n > 0 ? sum / n : 0; UNPROTECT(1); return ans;

")

with a test case

nx <- 4e7

nb <- 1e3

x <- sort(runif(nx))

bx <- do.call(seq, c(as.list(range(x)), length.out=nb))

leading to

> bx1 <- c(bx[-1], bx[nb] + 1)

> system.time(res <- binmean(x, bx1))

user system elapsed

0.052 0.000 0.050

Martin

*>
*

> Thanks,

*>
**> Henrik
**>
**> ______________________________________________
**> R-devel_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-devel
**>
*

-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ______________________________________________ R-devel_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-develReceived on Wed 03 Oct 2012 - 15:27:55 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

*
Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.
Archive generated by hypermail 2.2.0, at Wed 03 Oct 2012 - 22:10:45 GMT.
*

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel.
Please read the posting
guide before posting to the list.
*