Re: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences

From: Gabor Grothendieck <ggrothendieck_at_gmail.com>
Date: Sun 23 Jul 2006 - 14:05:26 EST

The following requires more than just a single gsub but it does solve the problem. Modify to suit.

The first gsub places <...> around the first occurrence of any duplicated suffixes. We use the (?=...) zero width regexp to circumvent the nesting problem.

Then we use strapply from the gsubfn package to extract the suffixes so marked and paste them together to pass to a second gsub which locates them in the original string appending an <r> to each. Uncomment the commented pat if you only want to match 2+ character suffixes.

library(gsubfn)
# places <...> around first occurrences of repeated suffixes text <- "And this is the second sentence"

pat <- "(\\w+)(?=\\b.+\\1\\b)"
# pat <- "(\\w\\w+)(?=\\b.+\\1\\b)"
out <- gsub(pat, "\\<\\1\\>", text, perl = TRUE)

suff <- strapply(out, "<([^>]+)>", function(x,y)y)[[1]] gsub(paste("(", paste(suff, collapse = "|"), ")\\b", sep = ""), "\\1<r>", text)

On 7/22/06, Stefan Th. Gries <stgries_lists@arcor.de> wrote:
> Dear all
>
> I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and I have two related regular expression problems.
>
> platform i386-pc-mingw32
> arch i386
> os mingw32
> system i386, mingw32
> status
> major 2
> minor 3.1
> year 2006
> month 06
> day 01
> svn rev 38247
> language R
> version.string Version 2.3.1 (2006-06-01)
>
>
> I would like to find cases of words in elements of character vectors that end in the same character sequences; if I find such cases, I want to add <r> to both potentially rhyming sequences. An example:
>
> INPUT:This is my dog.
> DESIRED OUTPUT: This<r> is<r> my dog.
>
> I found a solution for cases where the potentially rhyming words are adjacent:
>
> text<-"This is my dog."
> gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
>
> However, with another text vector, I came across two problems I cannot seem to solve and for which I would love to get some input.
>
> (i) While I know what to do for non-adjacent words in general
>
> gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my dog", perl=TRUE) # I know this is not proper English ;-)
>
> this runs into problems with overlapping matches:
>
> text<-"And this is the second sentence"
> gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> [1] "And<r> this is the second<r> sentence"
>
> It finds the "nd" match, but since the "is" match is within the two "nd"'s, it doesn't get it. Any ideas on how to get all pairwise matches?
>
> (ii) How would one tell R to match only when there are 2+ characters matching? If the above expression is applied to another character string
>
> text<-"this is an example sentence."
> gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
>
> it also matches the "e"'s at the end of example and sentence. It's not possible to get rid of that by specifying a range such as {2,}
>
> text<-"this is an example sentence."
> gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
>
> because, as I understand it, this requires the 2+ cases of \\w to be identical characters:
>
> text<-"doo yoo see mee?"
> gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
>
> Again, any ideas?
>
> I'd really appreciate any snippets of codes, pointers, etc.
> Thanks so much,
> STG
> --
> Stefan Th. Gries
> -----------------------------------------------
> University of California, Santa Barbara
> http://www.linguistics.ucsb.edu/faculty/stgries
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun Jul 23 14:12:34 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 26 Jul 2006 - 23:03:41 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.