From: Ravi Varadhan <rvaradhan_at_jhmi.edu>

Date: Thu, 14 Apr 2011 21:19:41 -0400

From: r-help-bounces_at_r-project.org [r-help-bounces_at_r-project.org] On Behalf Of Douglas Bates [bates_at_stat.wisc.edu] Sent: Thursday, April 14, 2011 6:22 PM

To: William Dunlap

Cc: r-help_at_r-project.org; Sunduz Keles; rcpp-devel; Kevin Ummel Subject: Re: [R] Find number of elements less than some number: Elegant/fastsolution needed

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 15 Apr 2011 - 06:11:04 GMT

Date: Thu, 14 Apr 2011 21:19:41 -0400

Bill's code is insanely fast!

f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))

n1 <- 1e07

n2 <- 10^c(1,2,3,4,5,6,7)

tt <- rep(NA, 7)

x <- rnorm(n1)

for (i in 1:length(n2)){

y <- runif(n2[i])

tt[i] <- system.time(a1 <- f2(x, y))[3]

}

*> tt
*

[1] 0.70 0.86 1.03 1.28 1.54 4.99 12.07

*>
*

I would be surprised if this can be beaten even in C/C++.

Ravi.

From: r-help-bounces_at_r-project.org [r-help-bounces_at_r-project.org] On Behalf Of Douglas Bates [bates_at_stat.wisc.edu] Sent: Thursday, April 14, 2011 6:22 PM

To: William Dunlap

Cc: r-help_at_r-project.org; Sunduz Keles; rcpp-devel; Kevin Ummel Subject: Re: [R] Find number of elements less than some number: Elegant/fastsolution needed

My colleague Sunduz Keles once mentioned a similar problem to me. She had a large sample from a reference distribution and a test sample (both real-valued in her case) and she wanted, for each element of the test sample, the proportion of the reference sample that was less than the element. It's a type of empirical p-value calculation.

I forget the exact sizes of the samples (do you remember, Sunduz?) but they were in the tens of thousands or larger. Solutions in R tended to involve comparing every element in the test sample to every element in the reference sample but, of course, that is unnecessary. If you sort both samples then you can start the comparisons for a particular element of the test sample at the element of the reference sample where the last comparison failed.

I was able to write a very short C++ function using the Rcpp package that provided about a 1000-fold increase in speed relative to the best I could do in R. I don't have the script on this computer so I will post it tomorrow when I am back on the computer at the office.

On Thu, Apr 14, 2011 at 4:45 PM, William Dunlap <wdunlap_at_tibco.com> wrote:

>> -----Original Message-----

*>> From: r-help-bounces_at_r-project.org
**>> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Kevin Ummel
**>> Sent: Thursday, April 14, 2011 12:35 PM
**>> To: r-help_at_r-project.org
**>> Subject: [R] Find number of elements less than some number:
**>> Elegant/fastsolution needed
**>>
**>> Take vector x and a subset y:
**>>
**>> x=1:10
**>>
**>> y=c(4,5,7,9)
**>>
**>> For each value in 'x', I want to know how many elements in
**>> 'y' are less than 'x'.
**>>
**>> An example would be:
**>>
**>> sapply(x,FUN=function(i) {length(which(y<i))})
**>> [1] 0 0 0 0 1 2 2 3 3 4
**>>
**>> But this solution is far too slow when x and y have lengths
**>> in the millions.
**>>
**>> I'm certain an elegant (and computationally efficient)
**>> solution exists, but I'm in the weeds at this point.
**>
**> If x is sorted findInterval(x, y) would do it for <= but you
**> want strict <. Try
**> f2 <- function(x, y) length(y) - findInterval(-x, rev(-sort(y)))
**> where your version is
**> f0 <- function(x, y) sapply(x,FUN=function(i) {length(which(y<i))})
**> E.g.,
**> > x <- sort(sample(1e6, size=1e5))
**> > y <- sample(1e6, size=1e4, replace=TRUE)
**> > system.time(r0 <- f0(x,y))
**> user system elapsed
**> 7.900 0.310 8.211
**> > system.time(r2 <- f2(x,y))
**> user system elapsed
**> 0.000 0.000 0.004
**> > identical(r0, r2)
**> [1] TRUE
**>
**> Bill Dunlap
**> Spotfire, TIBCO Software
**> wdunlap tibco.com
**>
**>>
**>> Any help is much appreciated.
**>>
**>> Kevin
**>>
**>> University of Manchester
**>>
**>>
**>>
**>>
**>>
**>>
**>>
**>>
**>> Take two vectors x and y, where y is a subset of x:
**>>
**>> x=1:10
**>>
**>> y=c(2,5,6,9)
**>>
**>> If y is removed from x, the original x values now have a new
**>> placement (index) in the resulting vector (new):
**>>
**>> new=x[-y]
**>>
**>> index=1:length(new)
**>>
**>> The challenge is: How can I *quickly* and *efficiently*
**>> deduce the new 'index' value directly from the original 'x'
**>> value -- using only 'y' as an input?
**>>
**>> In practice, I have very large matrices containing the 'x'
**>> values, and I need to convert them to the corresponding
**>> 'index' if the 'y' values are removed.
**>>
**>>
**>>
**>>
**>> [[alternative HTML version deleted]]
**>>
**>> ______________________________________________
**>> R-help_at_r-project.org mailing list
**>> https://stat.ethz.ch/mailman/listinfo/r-help
**>> PLEASE do read the posting guide
**>> http://www.R-project.org/posting-guide.html
**>> and provide commented, minimal, self-contained, reproducible code.
**>>
**>
**> ______________________________________________
**> R-help_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
**> and provide commented, minimal, self-contained, reproducible code.
**>
*

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 15 Apr 2011 - 06:11:04 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Fri 15 Apr 2011 - 08:20:33 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*