[R] Regular expressions: bug or misunderstanding?

From: Duncan Murdoch <murdoch_at_stats.uwo.ca>
Date: Sun, 06 Jul 2008 17:17:04 -0400

I'm trying to write a gsub() call that takes a string and escapes all the unescaped quote marks in it. So the string


would be left unchanged, but


would be changed to


because the double backslash doesn't act as an escape for the quote, the first just escapes the second. I have the usual problems of writing regular expressions involving backslashes which make everything I write completely unreadable, so I'm going to change the problem for this post: I will define E to be the escape character, and q to be the quote; the gsub() call would leave


unchanged, but would change


to EEEq, etc.

The expression I have come up with after this change is

gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)

i.e. "(start of line, or non-escape, followed by an even number of escapes), all of which we call expression 1, followed by a quote, is replaced by expression 1 followed by an escape and a quote".

This works sometimes, but not always:

> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
[1] "Eq"
> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
[1] "EEEq"
> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
[1] "EqaEq"

> gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
[1] "qEq"

Notice that in the final example, the first quote doesn't get escaped. Why not????

Duncan Murdoch

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun 06 Jul 2008 - 21:26:34 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 06 Jul 2008 - 22:31:24 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive