Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Mon, 14 May 2007 09:42:44 +0100 (BST)

On Fri, 11 May 2007, Petr Savicky wrote:

> On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote:
>> I suggest you collaborate with the person who replied that he thought this
>> was a good idea to supply patches against the R-devel sources for
>> scrutiny.
>
> A possible solution is to use strncasecmp instead of strncmp
> in function fgrep_one in R-devel/src/main/character.c.
>
> Corresponding modification of character.c is at
> http://www.cs.cas.cz/~savicky/ignore_case/character.c
> and diff file w.r.t. the original character.c (downloaded today) is at
> http://www.cs.cas.cz/~savicky/ignore_case/diff.txt
>
> This seems to work in my installation of R-devel:
>
> > x <- c("D.G cat", "d.g cat", "dog cat")
> > z <- "d.g"
> > grep(z, x, ignore.case = F, fixed = T)
> [1] 2
> > grep(z, x, ignore.case = T, fixed = T) # this is the new behavior
> [1] 1 2
> > grep(z, x, ignore.case = T, fixed = F)
> [1] 1 2 3
> >
>
> Since fgrep_one is used many times in character.c, adding igcase_opt as
> an additional argument would imply extensive changes to the file.
> So, I introduced a new function fgrep_one_igcase called only once in
> the file. Another solution is possible.
>
> I do not understand well handling multibyte chars, so I did not test
> the function with real multibyte chars, although the code for
> this option is used.

Thanks for looking into this.

strncasecmp is not standard C (not even C99), but R does have a substitute for it. Unfortunately strncasecmp is not usable with multibyte charsets: Linux systems have wcsncasecmp but that is not portable. In these days of widespread use of UTF-8 that is a blocking issue, I am afraid.

In the case of grep I think all you need is

grep(tolower(pattern), tolower(x), fixed = TRUE)

and similarly for regexpr.

> Ignore case option is not meaningfull in gsub.

sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)

is different from 'ignore.case=FALSE', and I see the meaning as clear. So what did you mean? (Unfortunately the tolower trick does not work for [g]sub.)

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Mon 14 May 2007 - 08:47:24 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 17 May 2007 - 09:33:27 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.