Re: [R] gsub regexp question

From: Charilaos Skiadas <skiadas_at_hanover.edu>
Date: Sat 27 Jan 2007 - 21:34:39 GMT

On Jan 27, 2007, at 3:41 PM, Phillimore, Albert wrote:

> Dear R Users,
>
> I am trying to users gsub to remove multiple cases of square
> brackets and their different contents in a character string. A
> sample of such a string is shown below. However, I am having great
> difficulty understanding regexp syntax. Any help is greatly
> appreciated.
>
> Ally
>
> "tree STATE_286000 [&lnP=-12708.453945423369] = [&R] ((((((15
> [&rate=0.009761226401396686]:7.040851727747465,17
> [&rate=0.011500289631135564]:7.040851727747465)
> [&rate=0.010986570567484494]:2.257049446900292,(18
> [&rate=0.009123432243563103]:2.461289418776003,19
> [&rate=0.00981822432115329]:2.461289418776003)"

Is this what you want? I tend to prefer perl regular expressions:

 > str <- "tree STATE_286000 [&lnP=-12708.453945423369] = [&R] ((((((15[&rate=0.009761226401396686]:7.040851727747465,17

[&rate=0.011500289631135564]:7.040851727747465) 
[&rate=0.010986570567484494]:2.257049446900292,(18 
[&rate=0.009123432243563103]:2.461289418776003,19 
[&rate=0.00981822432115329]:2.461289418776003)"
 > gsub("\\[[^\\]]+\\]","",str, perl=T)
[1] "tree STATE_286000 =
((((((15:7.040851727747465,17:7.040851727747465):2.257049446900292, (18:2.461289418776003,19:2.461289418776003)"

As an explanation, \\[ and \\] match the two square brackets you want. We need to escape the brackets with the backslashes because they have a special meaning in perl regular expressions.

In perl regexps, "[....]" stands for "match a single character that is like what we have in the .... For instance [ab] will match an a or a b. [a-z] will match all lowercase characters. A ^ as a first character in there means "match all but what follows". for instance [^a-z] means match anything but lowercase characters. So [^\\]] means match any character but a closing bracket.

Finally the plus sign afterwards means: match at least one. So [^\\]] + means "match any sequence of characters that does not contain a closing bracket. So the whole thing now matches an opening bracket, followed by all characters until a corresponding closing bracket. This will not work if you have nested pairs of brackets, [like [so]]. That is a tad more delicate, and we can discuss it if you really need to deal with it.

Haris



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun Jan 28 08:39:00 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sat 27 Jan 2007 - 22:30:30 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.