Re: [R] Regex engine types

From: Gabor Grothendieck <ggrothendieck_at_gmail.com>
Date: Sat 10 Jun 2006 - 22:55:28 EST

I get the same result in a US collate ordering:

> strsplit(Sys.getlocale(), ";")

[[1]]
[1] "LC_COLLATE=English_United States.1252"
[2] "LC_CTYPE=English_United States.1252"
[3] "LC_MONETARY=English_United States.1252"
[4] "LC_NUMERIC=C"
[5] "LC_TIME=English_United States.1252"

> grep("[W-Z]", letters, value = TRUE)
[1] "x" "y" "z"
> R.version.string # Windows XP
[1] "Version 2.3.1 Patched (2006-06-04 r38279)"

On 6/10/06, Prof Brian Ripley <ripley@stats.ox.ac.uk> wrote:
> ?regex does describe this:
>
> A range of characters may be specified by giving the first and last
> characters, separated by a hyphen. (Character ranges are
> interpreted in the collation order of the current locale.)
>
> You did not tell us your locale, but based on questions from you in the
> past I would guess en_NZ.utf8. In that locale the collation order is
> wWxXyYzZ, so your surprise is explained. (It seems the PCRE code is not
> using the same ordering in that locale.)
>
> You may find it useful to set LC_COLLATE to C as I do:
>
> > strsplit(Sys.getlocale(), ";")
> [[1]]
> [1] "LC_CTYPE=en_GB" "LC_NUMERIC=C" "LC_TIME=en_GB"
> [4] "LC_COLLATE=C" "LC_MONETARY=en_GB" "LC_MESSAGES=en_GB"
> [7] "LC_PAPER=en_GB" "LC_NAME=C" "LC_ADDRESS=C"
> [10] "LC_TELEPHONE=C" "LC_MEASUREMENT=en_GB" "LC_IDENTIFICATION=C"
>
>
> On Sat, 10 Jun 2006, Patrick Connolly wrote:
>
> >> version
> > _
> > platform x86_64-unknown-linux-gnu
> > arch x86_64
> > os linux-gnu
> > system x86_64, linux-gnu
> > status
> > major 2
> > minor 2.1
> > year 2005
> > month 12
> > day 20
> > svn rev 36812
> > language R
> >>
> >
> >> grep("[W-Z]", LETTERS, value = TRUE)
> > [1] "W" "X" "Y" "Z"
> >
> > That's what I'd have expected.
> >
> >> grep("[W-Z]", letters, value = TRUE)
> > [1] "x" "y" "z"
> >
> > Not what I'd have thought. However,
> >
> >> grep("[B-D]", letters, value = TRUE, perl = TRUE)
> > character(0)
> >
> > So what is it that standard regular expressions use that's different
> > from Perl-type ones?
> >
> > The help file for grep refers to POSIX 1003.2 which looked a bit
> > daunting to delve into. From my limited reading, it seems there are
> > different gegex "Engine Types" which seems to be getting somewhat
> > tangential to what I was working on. I could probably avoid problems
> > if I always set perl=TRUE, but it would be good to know what basic and
> > extended regular expressions do that's different. If someone has a
> > quick line or two describing it, I'd be interested to know.
>
> --
> Brian D. Ripley, ripley@stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Jun 10 23:02:48 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sun 11 Jun 2006 - 05:34:59 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.