Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Thu, 17 May 2007 10:28:55 +0100 (BST)

On Thu, 17 May 2007, Petr Savicky wrote:

>> strncasecmp is not standard C (not even C99), but R does have a substitute
>> for it. Unfortunately strncasecmp is not usable with multibyte charsets:
>> Linux systems have wcsncasecmp but that is not portable. In these days of
>> widespread use of UTF-8 that is a blocking issue, I am afraid.
>
> What could help are the functions mbrtowc and towctrans and simple
> long integer comparison. Are the functions mbrtowc and towctrans
> available under Windows? mbrtowc seems to be available as Rmbrtowc
> in src/gnuwin32/extra.c.
>
> I did not find towctrans defined in R sources, but it is in
> gnuwin32/Rdll.hide

I don't see it in Rdll.hide. It is a C99 function (see your unix man page).

> and used in do_tolower. Does this mean that tolower is not usable
> with utf-8 under Windows?

UTF-8 is not usable under Windows, but tolower works in Windows DBCS (in so far as that makes sense: Chinese chars do not have 'case').

Rmbrtowc reflects an attempt to add UTF-8 support on Windows, but that is not currently active.

>> In the case of grep I think all you need is
>>
>> grep(tolower(pattern), tolower(x), fixed = TRUE)
>>
>> and similarly for regexpr.
>
> Yes. this is correct, but it has disadvantages. It needs more
> space and, if value=TRUE, we would have to do something like
> x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)]
> This is hard to implement in src/library/base/R/grep.R,
> where the call to .Internal(grep(pattern,...)) is the last command
> and I think this should be preserved.
>
>>> Ignore case option is not meaningfull in gsub.
>>
>> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
>>
>> is different from 'ignore.case=FALSE', and I see the meaning as clear.
>> So what did you mean? (Unfortunately the tolower trick does not work for
>> [g]sub.)
>
> The meaning of ignore.case in [g]sub is problematic due to the following.
> sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
> produces
> [1] "xyzD" "xyzd"
> but the user may in fact need the following
> [1] "XYZD" "xyzd"

He may, but that is not what 'ignore case' means, more like 'case honouring'.

> It is correct that "xyzD" "xyzd" is produced, but the user
> should be aware of the fact that several substitutions like
> x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE
> sub("ABC", "XYZ", x) # ignore.case=FALSE
> may be more useful.
>
> I have another question concerning the speed of grep. I expected that
> fgrep_one function is slower than calling a library routine
> for regular expressions. In particular, if the pattern has a lot of
> long partial matches in the target string, I expected that it may be much
> slower. A short example is
> y <- "aaaaaaaaab"
> x <- "aaaaaaaaaaaaaaaaaaab"
> grep(y,x)
> which requires 110 comparisons (10 comparisons for each of 11 possible
> beginnings of y in x). In general, the complexity in the worst case is
> O(m*n), where m,n are the lengths of y,x resp. I would expect that
> the library function for matching regular expressions needs
> time O(m+n) and so will be faster. However, the result obtained
> on a larger example is
>
> > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "")
> > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "")
> > y <- paste(c(rep("a", times = 10000), x2), collapse = "")
> > z <- rep(y, times = 100)
>
> > system.time(i <- grep(x1, z, fixed = T))
> [1] 1.970 0.000 1.985 0.000 0.000
>
> > system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow (*)
> [1] 40.374 0.003 40.381 0.000 0.000
>
> > system.time(i <- grep(x2, z, fixed = T))
> [1] 0.113 0.000 0.113 0.000 0.000
>
> > system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than fgrep_one
> [1] 0.019 0.000 0.019 0.000 0.000
>
> Do you have an explanation of these results, in particular (*)?

Yes, there is a comment on the help page to that effect. But these are highly atypical uses. Try perl=TRUE, and be aware that the locale matters a lot in such tests (via the charset).

No one is attempting to make R a fast string-processing language and so developers resources are spent on performance where it matters to more typical usage. (E.g. reducing duplication in as.double and friends speeds up just about every R session, and speeds up some numerical sessions dramatically.)

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Thu 17 May 2007 - 09:33:11 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 17 May 2007 - 19:33:08 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.