From: Thomas Lumley <tlumley_at_u.washington.edu>

Date: Wed 19 Jul 2006 - 00:55:31 EST

for starts and ends of matches or

to get a list of the actual matching sequences.

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed Jul 19 01:13:36 2006

Date: Wed 19 Jul 2006 - 00:55:31 EST

On Mon, 17 Jul 2006, John Wiedenhoeft wrote:

> Hi there,

*>
**> I'm having myself a hard time writing an algorithm for finding patterns
**> within a given melody. In a vector I'd like to find ALL sequences that
**> occur at least twice, without having to check all possible patterns via
**> pattern matching.
**>
*

Another approach, which works for not-too-long vectors like you have is:

n <- length(v)

matches <- outer(v, v, "==") & outer(1:n,1:n,">=")

Now matches has TRUE where v[i]==v[j]. For a longer match you would also need v[i+1]==v[j+1] and so on, making a diagonal line through the matrix. Diagonal lines are hard, so let's turn them into horizontal lines

matches <- matrix(cbind(matches, FALSE), ncol=n)

now row i+1 column j of matches is TRUE for a single entry match starting at position j at a separation of i. If there is a match of length 2, then column j+1 will also be TRUE, and so on.

Now rle() applied to a row will return the lengths of consecutive sequences of TRUE and FALSE. The lengths of consecutive sequences of TRUE are the lengths of the matches. To get rid of trivial matches of length less than 2 do

match2 <- t(apply(matches,1,function(row){

r<-rle(row) r$values[r$lengths<2]<-FALSE inverse.rle(r) }))

And finally, to extract the matches

results <- apply(match2, 1, function(row){

r<-rle(row) n<-length(r$lengths) ends<-cumsum(r$lengths) starts<-cumsum(c(1,r$lengths))[1:n] list(starts[r$values],ends[r$values]) })

for starts and ends of matches or

results <- apply(match2, 1, function(row){

r<-rle(row) n<-length(r$lengths) ends<-cumsum(r$lengths)[r$values] starts<-cumsum(c(1,r$lengths))[1:n][r$values] mapply(function(stt,end) v[stt:end],starts,ends, SIMPLIFY=FALSE) })

to get a list of the actual matching sequences.

-thomas

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed Jul 19 01:13:36 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Wed 19 Jul 2006 - 02:17:00 EST.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*