Re: [Rd] Fastest non-overlapping binning mean function out there?

From: Hervé Pagès <hpages_at_fhcrc.org>
Date: Tue, 02 Oct 2012 19:46:18 -0700

On 10/02/2012 06:11 PM, Hervé Pagès wrote:
> Hi Henrik,
>
> On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
>> Hi,
>>
>> I'm looking for a super-duper fast mean/sum binning implementation
>> available in R, and before implementing z = binnedMeans(x y) in native
>> code myself, does any one know of an existing function/package for
>> this? I'm sure it already exists. So, given data (x,y) and B bins
>> bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the
>> binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x <
>> bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's
>> assume there are no missing values and 'x' and 'bx' is already
>> ordered. The length of 'x' is in the order of 10,000-millions. The
>> number of elements in each bin vary.
>
> You didn't say if you have a lot of bins or not. If you don't have a lot
> of bins (e.g. < 10000), something like
>
> aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean)
>
> might not be too bad:
>
> > x <- seq(0, 8, by=0.1)
> > bx <- c(2, 2.5, 4, 5.8)
> > aggregate(x, by=list(bin=findInterval(x, bx)), FUN=mean)
> bin x
> 1 0 0.95
> 2 1 2.20
> 3 2 3.20
> 4 3 4.85
> 5 4 6.90

Of course, if you have a lot of bins, using aggregate() is not optimal. But you can replace it by your own optimized version e.g.:

   ## 'bin' must be a sorted vector of non-negative integers of the    ## same length as 'x'.
   fast_aggregate_mean <- function(x, bin, nbins)    {

     bin_count <- tabulate(bin + 1L, nbins=nbins)
     diff(c(0L, cumsum(x)[cumsum(bin_count)])) / bin_count
   }

Then:

   bin <- findInterval(x, bx)
   fast_aggregate_mean(x, bin, nbins=length(bx)+1L)

On my machine this is 100x faster or more than using aggregate() when the number of bins is > 100k. Memory usage is also reduced a lot. Another benefit of using fast_aggregate_mean() over aggregate() is that all the bins are represented in the output (aggregate() ignores empty bins).

Cheers,
H.

>
> I didn't try it on a 10,000-millions-elements vector though (and I've
> no idea how I could do this).
>
> H.
>
>
>>
>> Thanks,
>>
>> Henrik
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages_at_fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Wed 03 Oct 2012 - 03:36:00 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 03 Oct 2012 - 13:00:44 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive