Re: [R] Regular expressions: bug or misunderstanding?

From: Duncan Murdoch <murdoch_at_stats.uwo.ca>
Date: Sun, 06 Jul 2008 20:15:44 -0400

On 06/07/2008 7:37 PM, Gabor Grothendieck wrote: > Look at the discussion of zero width lookahead assertions in ?regex . > Use perl = TRUE as previously indicated.

Thanks, this seems to work:

gsub( "(?<!E)((EE)*)q", "\\1Eq", x, perl=TRUE)

Duncan Murdoch

> 
> On Sun, Jul 6, 2008 at 7:29 PM, Duncan Murdoch <murdoch_at_stats.uwo.ca> wrote:
>> On 06/07/2008 5:37 PM, (Ted Harding) wrote:
>>> On 06-Jul-08 21:17:04, Duncan Murdoch wrote:

>>>> I'm trying to write a gsub() call that takes a string and escapes all the
>>>> unescaped quote marks in it. So the string
>>>>
>>>> \"
>>>>
>>>> would be left unchanged, but
>>>>
>>>> \\"
>>>>
>>>> would be changed to
>>>>
>>>> \\\"
>>>>
>>>> because the double backslash doesn't act as an escape for the quote,
>>>> the first just escapes the second. I have the usual problems of
>>>> writing regular expressions involving backslashes which make
>>>> everything I write completely unreadable, so I'm going to change
>>>> the problem for this post: I will define E to be the escape
>>>> character, and q to be the quote; the gsub() call would leave
>>>>
>>>> Eq
>>>>
>>>> unchanged, but would change
>>>>
>>>> EEq
>>>>
>>>> to EEEq, etc.
>>>>
>>>> The expression I have come up with after this change is
>>>>
>>>> gsub( "((^|[^E])(EE)*)q", "\\1Eq", x)
>>>>
>>>> i.e. "(start of line, or non-escape, followed by an even number of
>>>> escapes), all of which we call expression 1, followed by a quote,
>>>> is replaced by expression 1 followed by an escape and a quote".
>>>>
>>>> This works sometimes, but not always:
>>>>
>>>> > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "Eq")
>>>> [1] "Eq"
>>>> > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "EEq")
>>>> [1] "EEEq"
>>>> > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qaq")
>>>> [1] "EqaEq"
>>>> > gsub( "((^|[^E])(EE)*)q", "\\1Eq", "qq")
>>>> [1] "qEq"
>>>>
>>>> Notice that in the final example, the first quote doesn't get escaped.
>>>> Why not????
>>> I think (without having done the "experimental diagnostics")
>>> that it's because in "qq" the first q mtaches (^|[^E]) because
>>> it matches [^E] (i.e. is a "non-escape"); since it is followed
>>> by q, it is the second q which gets the escape. Possibly you
>>> need to include "^q" as an additional alternative match at the
>>> start of the line.
>> Thanks, that sounds right, but now I can't see how to fix it.  Is there
>> syntax to say:  match A only if it follows B, but don't match the B part?
>>
>> Duncan Murdoch
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 07 Jul 2008 - 00:20:47 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 07 Jul 2008 - 02:32:14 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive