Re: [R] Re gular Expression help

From: Gabor Grothendieck <ggrothendieck_at_gmail.com>
Date: Sat, 08 Nov 2008 17:58:05 -0500

I'll see if I can speed it up if I get some time. I personally use it on relatively short strings where the low absolute time means that the higher relative time your comparisons show are not that important.

On Sat, Nov 8, 2008 at 5:33 PM, Wacek Kusnierczyk <Waclaw.Marcin.Kusnierczyk_at_idi.ntnu.no> wrote:
> Gabor Grothendieck wrote:
>> I suspect strapply is only relatively slow on short strings where
>> it doesn't matter anyways since for long strings performance would
>> likely be dominated by the underlying regexp operations. I know that
>> users are using the package for very long strings since I once had
>> to lift the 25,000 character limit since I had complaints about that.
>> The expressiveness and brevity of strapply (it would be shortest if it
>> were not for the length of the word simplify) offset any disadvantage
>> in my view.
>>
> ok, the attached tests against strings of length 30000 where the
> character that matches is precisely the last one. (gabor3 is dummy,
> because i had no patience to wait over a minute...) note that the
> strapply version is still approximately an order of magnitude slower.
>
> with the original script and string lenght (m) set to 10000, the
> strapply version is two orders of magnitude slower.
>
> it might be that the test is poor, though -- design a smart test where
> strapply wins ;)
> (related to the original problem, of course.)
>
> vQ
>
> generate = function(n, m)
> replicate(n, paste(paste(sample(letters[c(1:15, 18:26)], m, replace=TRUE), collapse=""), sample(letters[16:17], 1), sep=""))
>
> tests = list(
>
> wacek =
> function(data) {
> p = grep("^[^pq]*p", data)
> list(p=data[p], q=data[-p])
> },
>
> gabor1 =
> function(data)
> sapply(c(p="^[^pq]*p", q="^[^pq]*q"), grep, x=data, value=TRUE),
>
> gabor2 =
> function(data)
> tapply(data, sub("^[^pq]*p(.).*", "\\1", data), c),
>
> gabor3 =
> function(data) 0,
> # tapply(data, substr(gsub("[^pq]", "", data), 1, 1), c),
>
> gabor4 =
> { library(gsubfn); function(data)
> tapply(data, strapply(data, "^[^pq]*(.)", simplify=c), c) }
> )
>
> data = generate(10,30000)
> for (name in names(tests)) {
> cat(name, ":\n", sep="")
> print(system.time(replicate(30,tests[[name]](data)))) }
>
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 08 Nov 2008 - 22:59:50 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 08 Nov 2008 - 23:30:23 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive