Re: [R] help with regexpr in gsub

From: Marc Schwartz <marc_schwartz_at_comcast.net>
Date: Thu 18 Jan 2007 - 13:04:15 GMT

On Thu, 2007-01-18 at 04:49 +0000, Prof Brian Ripley wrote:
> One thing to watch with experiments like this is that the locale will
> matter. Character operations will be faster in a single-byte locale (as
> used here) than in a variable-byte locale (and I suspect Seth and Marc
> used UTF-8), and the relative speeds may alter. Also, the PCRE regexps
> are often much faster, and 'useBytes' can be much faster with ASCII data
> in UTF-8.
>
> For example:
>
> # R-devel, x86_64 Linux
> library(GO)
> goids <- ls(GOTERM)
> gids <- paste(goids, "ISS", sep=".")
> go.ids <- rep(gids, 10)
> > length(go.ids)
> [1] 205950
>
> # In en_GB (single byte)
>
> > system.time(z <- gsub("[.].*", "", go.ids))
> user system elapsed
> 1.709 0.004 1.716
> > system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
> user system elapsed
> 0.241 0.004 0.246
>
> > system.time(z <- gsub('\\..+$','', go.ids))
> user system elapsed
> 2.254 0.018 2.286
> > system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
> user system elapsed
> 2.890 0.002 2.895
> > system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
> user system elapsed
> 2.716 0.002 2.721
> > system.time(z <- sub("\\..+", "", go.ids))
> user system elapsed
> 1.724 0.001 1.725
> > system.time(z <- substr(go.ids, 0, 10))
> user system elapsed
> 0.084 0.000 0.084
>
> # in en_GB.utf8
>
> > system.time(z <- gsub("[.].*", "", go.ids))
> user system elapsed
> 1.689 0.020 1.712
> > system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE))
> user system elapsed
> 0.718 0.017 0.736
> > system.time(z <- gsub("[.].*", "", go.ids, perl=TRUE, useByte=TRUE))
> user system elapsed
> 0.243 0.001 0.244
>
> > system.time(z <- gsub('\\..+$','', go.ids))
> user system elapsed
> 2.509 0.024 2.537
> > system.time(z <- gsub('([^.]+)\\..*','\\1',go.ids))
> user system elapsed
> 3.772 0.004 3.779
> > system.time(z <- sub("([GO:0-9]+)\\..*$", "\\1", go.ids))
> user system elapsed
> 4.088 0.007 4.099
> > system.time(z <- sub("\\..+", "", go.ids))
> user system elapsed
> 1.920 0.004 1.927
> > system.time(z <- substr(go.ids, 0, 10))
> user system elapsed
> 0.096 0.002 0.098
>
> substr still wins, but by a much smaller margin.

<snip>

Just to confirm Prof. Ripley's suspicion, that I am indeed running in en_US.UTF-8.

Thanks for taking the time to point this out.

Best regards,

Marc



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri Jan 19 00:30:50 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 18 Jan 2007 - 14:00:23 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.