Re: [R] Nested functions

From: Thomas Lumley <tlumley_at_u.washington.edu>
Date: Wed 19 Jul 2006 - 00:55:31 EST

On Mon, 17 Jul 2006, John Wiedenhoeft wrote:

> Hi there,
>
> I'm having myself a hard time writing an algorithm for finding patterns
> within a given melody. In a vector I'd like to find ALL sequences that
> occur at least twice, without having to check all possible patterns via
> pattern matching.
>

Another approach, which works for not-too-long vectors like you have is:

   n <- length(v)
   matches <- outer(v, v, "==") & outer(1:n,1:n,">=")

Now matches has TRUE where v[i]==v[j]. For a longer match you would also need v[i+1]==v[j+1] and so on, making a diagonal line through the matrix. Diagonal lines are hard, so let's turn them into horizontal lines

   matches <- matrix(cbind(matches, FALSE), ncol=n)

now row i+1 column j of matches is TRUE for a single entry match starting at position j at a separation of i. If there is a match of length 2, then column j+1 will also be TRUE, and so on.

Now rle() applied to a row will return the lengths of consecutive sequences of TRUE and FALSE. The lengths of consecutive sequences of TRUE are the lengths of the matches. To get rid of trivial matches of length less than 2 do

   match2 <- t(apply(matches,1,function(row){

                      r<-rle(row)
                      r$values[r$lengths<2]<-FALSE
 	             inverse.rle(r)
                   }))



And finally, to extract the matches

   results <- apply(match2, 1, function(row){

                            r<-rle(row)
                            n<-length(r$lengths)
                            ends<-cumsum(r$lengths)
                            starts<-cumsum(c(1,r$lengths))[1:n]
                            list(starts[r$values],ends[r$values])
                     })

for starts and ends of matches or

   results <- apply(match2, 1, function(row){

                          r<-rle(row)
                          n<-length(r$lengths)
                          ends<-cumsum(r$lengths)[r$values]
                          starts<-cumsum(c(1,r$lengths))[1:n][r$values]
                          mapply(function(stt,end) v[stt:end],starts,ends,
                                  SIMPLIFY=FALSE)
                     })

to get a list of the actual matching sequences.

         -thomas



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed Jul 19 01:13:36 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 19 Jul 2006 - 02:17:00 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.