From: Martin Morgan <mtmorgan_at_fhcrc.org>

Date: Wed, 06 Apr 2011 09:59:25 -0700

}

}

}

}

replicate(3, warmup(y, x))

Date: Wed, 06 Apr 2011 09:59:25 -0700

On 04/04/2011 01:50 PM, William Dunlap wrote:

>> -----Original Message-----

*>> From: r-help-bounces_at_r-project.org
**>> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Stavros Macrakis
**>> Sent: Monday, April 04, 2011 1:15 PM
**>> To: r-help
**>> Subject: [R] General binary search?
**>>
**>> Is there a generic binary search routine in a standard library which
**>>
**>> a) works for character vectors
**>> b) runs in O(log(N)) time?
**>>
**>> I'm aware of findInterval(x,vec), but it is restricted to
**>> numeric vectors.
**>
**> xtfrm(x) will convert a character (or other) vector to
**> a numeric vector with the same ordering. findInterval
**> can work on that. E.g.,
**> > f0<- function(x, vec) {
**> tmp<- xtfrm(c(x, vec))
**> findInterval(tmp[seq_along(x)], tmp[-seq_along(x)])
**> }
**> > f0(c("Baby", "Aunt", "Dog"), LETTERS)
**> [1] 2 1 4
**> I've never looked at its speed.
*

For a little progress (though no 'generic binary searchin a standard library'), here's the 'one-liner'

bsearch1 <-

function(val, tab, L=1L, H=length(tab)) {

while (H >= L) { M <- L + (H - L) %/% 2L if (tab[M] > val) H <- M - 1L else if (tab[M] < val) L <- M + 1L else return(M) } return(L - 1L)

}

It seems like a good candidate for the new (R-2.13) 'compiler' package, so

library(compiler)

bsearch2 <- cmpfun(bsearch1)

And Bill's suggestion

bsearch3 <- function(val, tab) {

tmp <- xtfrm(c(val, tab)) findInterval(tmp[seq_along(val)], tmp[-seq_along(val)])}

which will work best when 'val' is a vector to be looked up.

A quick look at data.table:::sortedmatch seemed to return matches, whereas Stavros is looking for lower bounds.

It seems that one could shift the weight more to C code by 'vectorizing' the one-liner, first as

bsearch5 <-

function(val, tab, L=1L, H=length(tab)) {

b <- cbind(L=rep(L, length(val)), H=rep(H, length(val))) i0 <- seq_along(val) repeat { M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L i <- tab[M] > val[i0] b[i0 + i * length(val)] <- ifelse(i, M - 1L, ifelse(tab[M] < val[i0], M + 1L, M)) i0 <- which(b[i0, "H"] >= b[i0, "L"]) if (!length(i0)) break; } b[,"L"] - 1L

}

and then a little more thoughtfully (though more room for improvement) as

bsearch7 <-

function(val, tab, L=1L, H=length(tab)) {

b <- cbind(L=rep(L, length(val)), H=rep(H, length(val))) i0 <- seq_along(val) repeat { updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L tabM <- tab[M] val0 <- val[i0] i <- tabM < val0 updt[i] <- M[i] + 1L i <- tabM > val0 updt[i] <- M[i] - 1L b[i0 + i * length(val)] <- updt i0 <- which(b[i0, "H"] >= b[i0, "L"]) if (!length(i0)) break; } b[,"L"] - 1L

}

none of bsearch 3, 5, or 7 is likely to benefit substantially from compilation.

Here's a little test data set converting numeric to character as an easy cheat.

set.seed(123L)

x <- sort(as.character(rnorm(1e6)))

y <- as.character(rnorm(1e4))

There seems to be some significant initial overhead, so we warm things up (and also introduce the paradigm for multiple look-ups in bsearch 1, 2)

warmup <- function(y, x) {

lapply(y, bsearch1, x) lapply(y, bsearch2, x) bsearch3(y, x) bsearch5(y, x) bsearch7(y, x)

}

replicate(3, warmup(y, x))

and then time

> system.time(res1 <- unlist(lapply(y, bsearch1, x), use.names=FALSE))

user system elapsed

2.692 0.000 2.696

> system.time(res2 <- unlist(lapply(y, bsearch2, x), use.names=FALSE))

user system elapsed

1.379 0.000 1.380

> identical(res1, res2)

**[1] TRUE
**

> system.time(res3 <- bsearch3(y, x)); identical(res1, res3)

user system elapsed

8.339 0.001 8.350

**[1] TRUE
**

> system.time(res5 <- bsearch5(y, x)); identical(res1, res5)

user system elapsed

0.700 0.000 0.702

**[1] TRUE
**

> system.time(res7 <- bsearch7(y, x)); identical(res1, res7)

user system elapsed

0.222 0.000 0.222

**[1] TRUE
**
Martin

*>
*

> Bill Dunlap

*> Spotfire, TIBCO Software
**> wdunlap tibco.com
**>
**>>
**>> I'm also aware of various hashing solutions (e.g.
**>> new.env(hash=TRUE) and
**>> fastmatch), but I need the greatest-lower-bound match in my
**>> application.
**>>
**>> findInterval is also slow for large N=length(vec) because of the O(N)
**>> checking it does, as Duncan Murdoch has pointed
**>> out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>:
**>> though
**>> its documentation says it runs in O(n * log(N)), it actually
**>> runs in O(n *
**>> log(N) + N), which is quite noticeable for largish N. But
**>> that is easy
**>> enough to work around by writing a variant of findInterval which calls
**>> find_interv_vec without checking.
**>>
**>> -s
**>>
**>> PS Yes, binary search is a one-liner in R, but I always prefer to use
**>> standard, fast native libraries when possible....
**>>
**>> binarysearch<- function(val,tab,L,H) {while (H>=L) {
**>> M=L+(H-L) %/% 2; if
**>> (tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)};
**>> return(L-1)}
**>>
**>> [[alternative HTML version deleted]]
**>>
**>> ______________________________________________
**>> R-help_at_r-project.org mailing list
**>> https://stat.ethz.ch/mailman/listinfo/r-help
**>> PLEASE do read the posting guide
**>> http://www.R-project.org/posting-guide.html
**>> and provide commented, minimal, self-contained, reproducible code.
**>>
**>
**> ______________________________________________
**> R-help_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
**> and provide commented, minimal, self-contained, reproducible code.
*

-- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 ______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.Received on Wed 06 Apr 2011 - 17:04:31 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 07 Apr 2011 - 04:40:28 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*