Re: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences

From: Gabor Grothendieck <ggrothendieck_at_gmail.com>
Date: Wed 26 Jul 2006 - 11:05:32 EST

Here is yet another solution. This one consists only of two gsubs and a function to reverse a string. It runs at about the same speed as f3 but its main advantage is how compact it is.

pat could be the same as before however we have made use of Greg's discussion to use \\w\\w to avail ourself of his speedup idea. If single letter endings are ok use \\w instead of \\w\\w. This time the first gsub simply appends <r> to the first in any duplicated ending. Then we reverse the string. In the second gsub we look for any sequence at the start of a word for which >r< followed by that sequence is found later in the string and prepend >r< to that. Finally we reverse the result.

text <- "And this is the second sentence" strrev <- function(x) paste(rev(strsplit(x, "")[[1]]), collapse = "")

pat <- "(\\w\\w)(?=\\b.+\\1\\b)"
out <- strrev(gsub(pat, "\\1\\<r>", text, perl = TRUE)) strrev(gsub("\\b(\\w+)(?=.*>r<\\1)", ">r<\\1", out, perl = TRUE))

On 7/23/06, Gabor Grothendieck <ggrothendieck@gmail.com> wrote:
> The following requires more than just a single gsub but it does solve
> the problem. Modify to suit.
>
> The first gsub places <...> around the first occurrence of any
> duplicated suffixes. We use the (?=...) zero width regexp
> to circumvent the nesting problem.
>
> Then we use strapply from the gsubfn package to extract
> the suffixes so marked and paste them together to pass
> to a second gsub which locates them in the original
> string appending an <r> to each. Uncomment the commented
> pat if you only want to match 2+ character suffixes.
>
> library(gsubfn)
> # places <...> around first occurrences of repeated suffixes
> text <- "And this is the second sentence"
> pat <- "(\\w+)(?=\\b.+\\1\\b)"
> # pat <- "(\\w\\w+)(?=\\b.+\\1\\b)"
> out <- gsub(pat, "\\<\\1\\>", text, perl = TRUE)
>
> suff <- strapply(out, "<([^>]+)>", function(x,y)y)[[1]]
> gsub(paste("(", paste(suff, collapse = "|"), ")\\b", sep = ""), "\\1<r>", text)
>
>
> On 7/22/06, Stefan Th. Gries <stgries_lists@arcor.de> wrote:
> > Dear all
> >
> > I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and I have two related regular expression problems.
> >
> > platform i386-pc-mingw32
> > arch i386
> > os mingw32
> > system i386, mingw32
> > status
> > major 2
> > minor 3.1
> > year 2006
> > month 06
> > day 01
> > svn rev 38247
> > language R
> > version.string Version 2.3.1 (2006-06-01)
> >
> >
> > I would like to find cases of words in elements of character vectors that end in the same character sequences; if I find such cases, I want to add <r> to both potentially rhyming sequences. An example:
> >
> > INPUT:This is my dog.
> > DESIRED OUTPUT: This<r> is<r> my dog.
> >
> > I found a solution for cases where the potentially rhyming words are adjacent:
> >
> > text<-"This is my dog."
> > gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> >
> > However, with another text vector, I came across two problems I cannot seem to solve and for which I would love to get some input.
> >
> > (i) While I know what to do for non-adjacent words in general
> >
> > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my dog", perl=TRUE) # I know this is not proper English ;-)
> >
> > this runs into problems with overlapping matches:
> >
> > text<-"And this is the second sentence"
> > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> > [1] "And<r> this is the second<r> sentence"
> >
> > It finds the "nd" match, but since the "is" match is within the two "nd"'s, it doesn't get it. Any ideas on how to get all pairwise matches?
> >
> > (ii) How would one tell R to match only when there are 2+ characters matching? If the above expression is applied to another character string
> >
> > text<-"this is an example sentence."
> > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> >
> > it also matches the "e"'s at the end of example and sentence. It's not possible to get rid of that by specifying a range such as {2,}
> >
> > text<-"this is an example sentence."
> > gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> >
> > because, as I understand it, this requires the 2+ cases of \\w to be identical characters:
> >
> > text<-"doo yoo see mee?"
> > gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE)
> >
> > Again, any ideas?
> >
> > I'd really appreciate any snippets of codes, pointers, etc.
> > Thanks so much,
> > STG
> > --
> > Stefan Th. Gries
> > -----------------------------------------------
> > University of California, Santa Barbara
> > http://www.linguistics.ucsb.edu/faculty/stgries
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed Jul 26 12:20:45 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 26 Jul 2006 - 14:27:21 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.