Re: [Rd] "bug" and patch: quadratic running time for strsplit(..., fixed=TRUE) (PR#9902)

From: <maechler_at_stat.math.ethz.ch>
Date: Sat, 08 Sep 2007 11:50:54 +0200 (CEST)


>>>>> "JB" == John Brzustowski <jbrzusto_at_fastmail.fm>

>>>>>     on Fri,  7 Sep 2007 15:47:26 +0200 (CEST) writes:

>>>>> "JB" == John Brzustowski <jbrzusto_at_fastmail.fm>
>>>>> on Fri, 7 Sep 2007 15:47:26 +0200 (CEST) writes: JB> Full_Name: John Brzustowski JB> Version: R-devel-trunk, R-2.4.0

    JB> OS: linux, gcc 4.0.3
    JB> Submission from: (NULL) (206.248.157.184)

    JB> This isn't a bug, but an easily-remedied performance issue.

    JB> SYMPTOM     >> for (i in 1000 * (1:20)) {

    JB> y <- paste(rep("asdf", times=i), collapse=" ")
    JB> t <- system.time(strsplit(y, " ", fixed=TRUE))
    JB> cat(sprintf("i=%5d time=%5d msec\n",i, round(1000*t[1])))
    JB> }

    JB> i= 1000 time=    2 msec
    JB> i= 2000 time=    9 msec

    JB> i= 3000 time= 20 msec
    JB> i= 4000 time= 34 msec

etc

    JB> i=17000 time=  726 msec
    JB> i=18000 time=  864 msec
    JB> i=19000 time=  944 msec
    JB> i=20000 time= 1106 msec

    JB> DIAGNOSIS

    JB> strsplit() uses strlen() in the bounds check clause of a for(;;)
    JB> statement, which forces a full scan of the source string for each
    JB> character in the source string.  Unlike R's LENGTH() macro, strlen for
    JB> C strings is an expensive operation, and in this case (at least),
    JB> gcc 4.0.3's -O2 level optimizer is not able to recognize the call as a loop
    JB> invariant, despite the declaration "const char *buf".

    JB> REMEDIED BEHAVIOUR

    JB> i= 1000 time=    0 msec
    JB> i= 2000 time=    1 msec
    JB> i= 3000 time=    1 msec
    JB> i= 4000 time=    0 msec
    JB> i= 5000 time=    1 msec
    JB> i= 6000 time=    1 msec
    JB> i= 7000 time=    1 msec

...

I can confirm this behavior before and after your patch, even for the quite recent gcc 4.2.1 and -O3 (which I have changed the default to, at least on some platforms).

Thank you for this excellent spotting. I haven been more naive about today's compiler optimizations and assumed they "surely" would optimize such a case.

Thanks a lot for your nice demonstration and patches! I'm about to commit them to the development trunk, but most probably they will even be back-ported to R 2.6.0 alpha.

Best regards,
Martin

    JB> RELATED ISSUES

    JB> A simple search turns up other instances of this usage in R's source.
    JB> For completeness, I'm submitting patches for all of them, but have not
    JB> tested whether they in fact cause a detectable performance problem.
    JB> In the case of modules/X11/dataentry.c, the patch also fixes a presumably
    JB> ineffectual "bug".

    JB> $ grep -nR "for *([^;]*;[^;]*strlen *(" *

    JB> main/rlocale.c:137:		for (i = 0; i < strlen(lc_str) && i < sizeof(lc_str); i++)
    JB> main/printutils.c:486:		for(j = 0; j < strlen(buf); j++) *q++ = buf[j];
    JB> main/sysutils.c:493:		for(j = 0; j < strlen(sub); j++) *outbuf++ = sub[j];
    JB> modules/X11/rotated.c:608:	for(i=0; i<strlen(text)-1; i++)
    JB> modules/X11/rotated.c:856:	for(i=0; i<strlen(text)-1; i++)
    JB> modules/X11/rotated.c:1399:	for(i=0; i<strlen(text)-1; i++)
    JB> modules/X11/rotated.c:1797:	for(i=0; i<strlen(text)-1; i++)
    JB> modules/X11/rotated.c:2045:	for(i=0; i<strlen(text)-1; i++)
    JB> modules/X11/rotated.c:2339:	for(i=0; i<strlen(text)-1; i++)
    JB> modules/X11/dataentry.c:1358:   for(i = 0; i < strlen(text); i++) *bufp++ =
    JB> text[i];


    JB> PATCHES (against trunk)
    JB> Only the first is required to fix strsplit().

    JB> Index: src/main/character.c

[.....]



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat 08 Sep 2007 - 09:53:40 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 08 Sep 2007 - 22:40:24 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.