Re: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences

From: Greg Snow <Greg.Snow_at_intermountainmail.org>
Date: Wed 26 Jul 2006 - 02:56:37 EST


Using regular expression matching for this case may be overkill (the RE engine will be doing a lot of backtracking looking at a lot of non-matches). Here is an alternative that splits the text into a vector of words, extracts the last 2 letters of each word (remember if the last 3 letters match, then the last 2 have to match, so we only need to consider the last 2), then looks at all pairwise comparisons for matches, then pastes everything back together with the marked matches:

text<-"And this is a second rand sentence"

tmp1 <- strsplit(text, ' ')[[1]]
tmp2 <- nchar(tmp1)
tmp3 <- substr(tmp1,tmp2-1,tmp2)

tmp4 <- which(lower.tri(diag(length(tmp3))), arr.ind=TRUE) tmp5 <- tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ]

tmp6 <- rep('', length(tmp1))
count <- 1
for( i in which(tmp5) ){

        tmp6[ tmp4[i,1] ] <- paste(tmp6[ tmp4[i,1] ], '<r',count,'>',sep='')

        tmp6[ tmp4[i,2] ] <- paste(tmp6[ tmp4[i,2] ], '<r',count,'>',sep='')

        count <- count + 1
}

out.text <- paste( tmp1,tmp6, sep='',collapse=' ')

If you are doing a lot of text processing like this, I would suggest doing it in Perl rather than R. S Poetry by Dr. Burns has a function to take a vector of character strings in R and run a Perl script on it and return the results.

Hope this helps,

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow@intermountainmail.org

(801) 408-8111
-----Original Message----- From: r-help-bounces@stat.math.ethz.ch [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Stefan Th. Gries Sent: Saturday, July 22, 2006 7:49 PM To: r-help@stat.math.ethz.ch Subject: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences Dear all I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and I have two related regular expression problems. platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 3.1 year 2006 month 06 day 01 svn rev 38247 language R version.string Version 2.3.1 (2006-06-01) I would like to find cases of words in elements of character vectors that end in the same character sequences; if I find such cases, I want to add <r> to both potentially rhyming sequences. An example: INPUT:This is my dog. DESIRED OUTPUT: This<r> is<r> my dog. I found a solution for cases where the potentially rhyming words are adjacent: text<-"This is my dog." gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) However, with another text vector, I came across two problems I cannot seem to solve and for which I would love to get some input.
(i) While I know what to do for non-adjacent words in general
gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my dog", perl=TRUE) # I know this is not proper English ;-) this runs into problems with overlapping matches: text<-"And this is the second sentence" gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) [1] "And<r> this is the second<r> sentence" It finds the "nd" match, but since the "is" match is within the two "nd"'s, it doesn't get it. Any ideas on how to get all pairwise matches?
(ii) How would one tell R to match only when there are 2+ characters
matching? If the above expression is applied to another character string text<-"this is an example sentence." gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) it also matches the "e"'s at the end of example and sentence. It's not possible to get rid of that by specifying a range such as {2,} text<-"this is an example sentence." gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) because, as I understand it, this requires the 2+ cases of \\w to be identical characters: text<-"doo yoo see mee?" gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) Again, any ideas? I'd really appreciate any snippets of codes, pointers, etc. Thanks so much, STG -- Stefan Th. Gries ----------------------------------------------- University of California, Santa Barbara http://www.linguistics.ucsb.edu/faculty/stgries ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Received on Wed Jul 26 03:07:27 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 26 Jul 2006 - 04:23:37 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.