From: Petr Savicky <savicky_at_cs.cas.cz>

Date: Thu, 17 May 2007 11:03:00 +0200

grep(y,x)

which requires 110 comparisons (10 comparisons for each of 11 possible beginnings of y in x). In general, the complexity in the worst case is O(m*n), where m,n are the lengths of y,x resp. I would expect that the library function for matching regular expressions needs time O(m+n) and so will be faster. However, the result obtained on a larger example is

R-devel_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu 17 May 2007 - 09:07:20 GMT

Date: Thu, 17 May 2007 11:03:00 +0200

> strncasecmp is not standard C (not even C99), but R does have a substitute

*> for it. Unfortunately strncasecmp is not usable with multibyte charsets:
**> Linux systems have wcsncasecmp but that is not portable. In these days of
**> widespread use of UTF-8 that is a blocking issue, I am afraid.
*

What could help are the functions mbrtowc and towctrans and simple long integer comparison. Are the functions mbrtowc and towctrans available under Windows? mbrtowc seems to be available as Rmbrtowc in src/gnuwin32/extra.c.

I did not find towctrans defined in R sources, but it is in gnuwin32/Rdll.hide and used in do_tolower. Does this mean that tolower is not usable with utf-8 under Windows?

> In the case of grep I think all you need is

*>
**> grep(tolower(pattern), tolower(x), fixed = TRUE)
**>
**> and similarly for regexpr.
*

Yes. this is correct, but it has disadvantages. It needs more space and, if value=TRUE, we would have to do something like

x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)] This is hard to implement in src/library/base/R/grep.R, where the call to .Internal(grep(pattern,...)) is the last command and I think this should be preserved.

> >Ignore case option is not meaningfull in gsub.

*>
**> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
**>
**> is different from 'ignore.case=FALSE', and I see the meaning as clear.
**> So what did you mean? (Unfortunately the tolower trick does not work for
**> [g]sub.)
*

The meaning of ignore.case in [g]sub is problematic due to the following.
sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
produces

[1] "xyzD" "xyzd"

but the user may in fact need the following
[1] "XYZD" "xyzd"

It is correct that "xyzD" "xyzd" is produced, but the user should be aware of the fact that several substitutions like x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE sub("ABC", "XYZ", x) # ignore.case=FALSE may be more useful.

x <- "aaaaaaaaaaaaaaaaaaab"

grep(y,x)

which requires 110 comparisons (10 comparisons for each of 11 possible beginnings of y in x). In general, the complexity in the worst case is O(m*n), where m,n are the lengths of y,x resp. I would expect that the library function for matching regular expressions needs time O(m+n) and so will be faster. However, the result obtained on a larger example is

> x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "") > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "") > y <- paste(c(rep("a", times = 10000), x2), collapse = "") > z <- rep(y, times = 100)

> system.time(i <- grep(x1, z, fixed = T)) [1] 1.970 0.000 1.985 0.000 0.000

> system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow (*) [1] 40.374 0.003 40.381 0.000 0.000

> system.time(i <- grep(x2, z, fixed = T)) [1] 0.113 0.000 0.113 0.000 0.000

> system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than fgrep_one [1] 0.019 0.000 0.019 0.000 0.000

Do you have an explanation of these results, in particular (*)?

Petr.

R-devel_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu 17 May 2007 - 09:07:20 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 17 May 2007 - 16:33:05 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel.
Please read the posting
guide before posting to the list.
*