Re: [R] Find number of elements less than some number: Elegant/fastsolution needed

From: Douglas Bates <bates_at_stat.wisc.edu>
Date: Thu, 14 Apr 2011 21:02:37 -0500

On Thu, Apr 14, 2011 at 8:19 PM, Ravi Varadhan <rvaradhan_at_jhmi.edu> wrote:
> Bill's code is insanely fast!
>
> f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
>
>  n1 <- 1e07
> n2 <- 10^c(1,2,3,4,5,6,7)
> tt <- rep(NA, 7)
> x <- rnorm(n1)
> for (i in 1:length(n2)){
> y <- runif(n2[i])
> tt[i] <- system.time(a1 <- f2(x, y))[3]
> }
>
>> tt
> [1]  0.70  0.86  1.03  1.28  1.54  4.99 12.07
>>
>
> I would be surprised if this can be beaten even in C/C++.

Have you looked at the definition of the findInterval function, especially the part where it calls the C function?

I still think that it can be done faster if the x vector is sorted in addition to the y vector, although admittedly the speed gain will not be as much. And, of course, Bill's solution depends on knowing R or S-PLUS well enough to remember that there is a findInterval function.

> From: r-help-bounces@r-project.org [r-help-bounces@r-project.org] On Behalf Of Douglas Bates [bates@stat.wisc.edu]
> Sent: Thursday, April 14, 2011 6:22 PM
> To: William Dunlap
> Cc: r-help_at_r-project.org; Sunduz Keles; rcpp-devel; Kevin Ummel
> Subject: Re: [R] Find number of elements less than some number: Elegant/fastsolution needed
>
> My colleague Sunduz Keles once mentioned a similar problem to me.  She
> had a large sample from a reference distribution and a test sample
> (both real-valued in her case) and she wanted, for each element of the
> test sample, the proportion of the reference sample that was less than
> the element.  It's a type of empirical p-value calculation.
>
> I forget the exact sizes of the samples (do you remember, Sunduz?) but
> they were in the tens of thousands or larger.  Solutions in R tended
> to involve comparing every element in the test sample to every element
> in the reference sample but, of course, that is unnecessary.  If you
> sort both samples then you can start the comparisons for a particular
> element of the test sample at the element of the reference sample
> where the last comparison failed.
>
> I was able to write a very short C++ function using the Rcpp package
> that provided about a 1000-fold increase in speed relative to the best
> I could do in R.  I don't have the script on this computer so I will
> post it tomorrow when I am back on the computer at the office.
>
> Apologies for cross-posting to the Rcpp-devel list but I am doing so
> because this might make a good example of the usefulness of Rcpp and
> inline.
>
> On Thu, Apr 14, 2011 at 4:45 PM, William Dunlap <wdunlap_at_tibco.com> wrote:
>>> -----Original Message-----
>>> From: r-help-bounces_at_r-project.org
>>> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Kevin Ummel
>>> Sent: Thursday, April 14, 2011 12:35 PM
>>> To: r-help_at_r-project.org
>>> Subject: [R] Find number of elements less than some number:
>>> Elegant/fastsolution needed
>>>
>>> Take vector x and a subset y:
>>>
>>> x=1:10
>>>
>>> y=c(4,5,7,9)
>>>
>>> For each value in 'x', I want to know how many elements in
>>> 'y' are less than 'x'.
>>>
>>> An example would be:
>>>
>>> sapply(x,FUN=function(i) {length(which(y<i))})
>>>  [1] 0 0 0 0 1 2 2 3 3 4
>>>
>>> But this solution is far too slow when x and y have lengths
>>> in the millions.
>>>
>>> I'm certain an elegant (and computationally efficient)
>>> solution exists, but I'm in the weeds at this point.
>>
>> If x is sorted findInterval(x, y) would do it for <= but you
>> want strict <.  Try
>>  f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
>> where your version is
>>  f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))})
>> E.g.,
>>  > x <- sort(sample(1e6, size=1e5))
>>  > y <- sample(1e6, size=1e4, replace=TRUE)
>>  > system.time(r0 <- f0(x,y))
>>     user  system elapsed
>>    7.900   0.310   8.211
>>  > system.time(r2 <- f2(x,y))
>>     user  system elapsed
>>    0.000   0.000   0.004
>>  > identical(r0, r2)
>>  [1] TRUE
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>>
>>> Any help is much appreciated.
>>>
>>> Kevin
>>>
>>> University of Manchester
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Take two vectors x and y, where y is a subset of x:
>>>
>>> x=1:10
>>>
>>> y=c(2,5,6,9)
>>>
>>> If y is removed from x, the original x values now have a new
>>> placement (index) in the resulting vector (new):
>>>
>>> new=x[-y]
>>>
>>> index=1:length(new)
>>>
>>> The challenge is: How can I *quickly* and *efficiently*
>>> deduce the new 'index' value directly from the original 'x'
>>> value -- using only 'y' as an input?
>>>
>>> In practice, I have very large matrices containing the 'x'
>>> values, and I need to convert them to the corresponding
>>> 'index' if the 'y' values are removed.
>>>
>>>
>>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 15 Apr 2011 - 02:07:42 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 15 Apr 2011 - 02:30:31 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive