Re: [R] Re gular Expression help

From: Wacek Kusnierczyk <>
Date: Sat, 08 Nov 2008 23:33:34 +0100

Gabor Grothendieck wrote:
> I suspect strapply is only relatively slow on short strings where
> it doesn't matter anyways since for long strings performance would
> likely be dominated by the underlying regexp operations. I know that
> users are using the package for very long strings since I once had
> to lift the 25,000 character limit since I had complaints about that.
> The expressiveness and brevity of strapply (it would be shortest if it
> were not for the length of the word simplify) offset any disadvantage
> in my view.
ok, the attached tests against strings of length 30000 where the character that matches is precisely the last one. (gabor3 is dummy, because i had no patience to wait over a minute...) note that the strapply version is still approximately an order of magnitude slower.

with the original script and string lenght (m) set to 10000, the strapply version is two orders of magnitude slower.

it might be that the test is poor, though -- design a smart test where strapply wins ;)
(related to the original problem, of course.)


generate = function(n, m)

        replicate(n, paste(paste(sample(letters[c(1:15, 18:26)], m, replace=TRUE), collapse=""), sample(letters[16:17], 1), sep=""))

tests = list(

	wacek =
	function(data) {
		p = grep("^[^pq]*p", data)
		list(p=data[p], q=data[-p])
	gabor1 =
		sapply(c(p="^[^pq]*p", q="^[^pq]*q"), grep, x=data, value=TRUE),
	gabor2 =
		tapply(data, sub("^[^pq]*p(.).*", "\\1", data), c),
	gabor3 =
	function(data) 0,
		# tapply(data, substr(gsub("[^pq]", "", data), 1, 1), c),
	gabor4 =
	{ library(gsubfn); function(data)
		tapply(data, strapply(data, "^[^pq]*(.)", simplify=c), c) }
data = generate(10,30000)
for (name in names(tests)) {
	cat(name, ":\n", sep="")
	print(system.time(replicate(30,tests[[name]](data)))) }

______________________________________________ mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Sat 08 Nov 2008 - 22:35:59 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 08 Nov 2008 - 23:30:23 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive