Re: [R] help with regexpr in gsub

From: Marc Schwartz <marc_schwartz_at_comcast.net>
Date: Thu 18 Jan 2007 - 01:10:30 GMT

On Wed, 2007-01-17 at 16:46 -0800, Seth Falcon wrote:
> "Kimpel, Mark William" <mkimpel@iupui.edu> writes:
>
> > I have a very long vector of character strings of the format
> > "GO:0008104.ISS" and need to strip off the dot and anything that follows
> > it. There are always 10 characters before the dot. The actual characters
> > and the number of them after the dot is variable.
> >
> > So, I would like to return in the format "GO:0008104" . I could do this
> > with substr and loop over the entire vector, but I thought there might
> > be a more elegant (and faster) way to do this.
> >
> > I have tried gsub using regular expressions without success. The code
> >
> > gsub(pattern= "\.*?" , replacement="", x=character.vector)
>
> I guess you want:
>
> sub("([GO:0-9]+)\\..*$", "\\1", goids)
>
> [You don't need gsub here]
>
> But I don't understand why you wouldn't want to use substr. At least
> for me substr looks to be about 20x faster than sub for this
> problem...
>
>
> > library(GO)
> > goids = ls(GOTERM)
> > gids = paste(goids, "ISS", sep=".")
> > gids[1:10]
> [1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS" "GO:0000004.ISS"

> [5] "GO:0000006.ISS" "GO:0000007.ISS" "GO:0000009.ISS" "GO:0000010.ISS"
> [9] "GO:0000011.ISS" "GO:0000012.ISS"
>
> > system.time(z <- substr(gids, 0, 10))
> user system elapsed
> 0.008 0.000 0.007
> > system.time(z2 <- sub("([GO:0-9]+)\\..*$", "\\1", gids))
> user system elapsed
> 0.136 0.000 0.134

I think that some of the overhead here in using sub() is due to the effective partitioning of the source vector, a more complex regex and then just returning the first element.

This can be shortened to:

# Note that I have 12 elements here
> gids

 [1] "GO:0000001.ISS" "GO:0000002.ISS" "GO:0000003.ISS" "GO:0000004.ISS"
 [5] "GO:0000005.ISS" "GO:0000006.ISS" "GO:0000007.ISS" "GO:0000008.ISS"
 [9] "GO:0000009.ISS" "GO:0000010.ISS" "GO:0000011.ISS" "GO:0000012.ISS"

> system.time(z2 <- sub("\\..+", "", gids))
[1] 0 0 0 0 0

> z2

 [1] "GO:0000001" "GO:0000002" "GO:0000003" "GO:0000004" "GO:0000005"
 [6] "GO:0000006" "GO:0000007" "GO:0000008" "GO:0000009" "GO:0000010"
[11] "GO:0000011" "GO:0000012"


Which would appear to be quicker than using substr().

HTH, Marc Schwartz



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu Jan 18 12:16:10 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 18 Jan 2007 - 02:30:29 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.