Re: [Rd] bug in rank(), order(), is.unsorted() on character vector

From: Hervé Pagès <hpages_at_fhcrc.org>
Date: Thu, 08 Dec 2011 01:57:02 -0800

Hi Paul,

On 11-12-07 10:29 AM, Roebuck,Paul L wrote:
> Do this first and try again.
>
> R> Sys.setlocale("LC_COLLATE", "C")

OK I see it now (in ?Sys.setlocale):

   Sys.setlocale("LC_COLLATE", "C") # turn off locale-specific sorting,

                                      #  usually

Thanks all for the answers!

I never really realized how far some collating sequence could go in terms of counter-intuitiveness e.g. the fact that LC_COLLATE=en_CA.UTF-8 doesn't preserve the order of the strings when a common suffix is added to them is scary. Also it's not that LC_COLLATE=en_CA.UTF-8 just ignores the '_' (underscores) and the '.' (dots), that can only be the first pass, then it needs to break ties in a way that defines a total order. So it looks like the exact definition of this collating sequence is counter-intuitive and complicated.

Maybe that's just how things are and the developers that want portability and reproducibility of their code are already putting a Sys.setlocale("LC_COLLATE", "C") statement somewhere in their package to force all their users to be on the same collating sequence. It sounds a little bit drastic though and it might introduce some conflicts with other packages.

So maybe a better approach is to only alter LC_COLLATE temporarily inside the functions where it matters i.e. where the returned value actually depends on the collating sequence? If I don't do this, then there is no way I can write a test for my function because the test would work for me but fail for someone else.

Actually this is the situation I was facing when I did my first post: I have a function that downloads a list of sequences from the Ensembl FTP server, sorts them by name, and returns them to the user. I have a test for that function and the test was working for me when I was doing

   tools::testInstalledPackage("MyPackage", "types="tests")

but it was failing when I was doing 'R CMD check'. It seems that the latter alters LC_COLLATE before running the tests (maybe to LC_COLLATE=C) but not the former. I fixed this by enforcing LC_COLLATE=C inside my function.

A naive question: wouldn't everything be simpler if LC_COLLATE=C was the default for everybody?

Thanks,
H.

>
>
> On 12/7/11 3:41 AM, "Hervé Pagès"<hpages@fhcrc.org> wrote:
>
>> Hi,
>>
>> This looks OK:
>>
>>> x<- c("_1_", "1_9", "2_9")
>>> rank(x)
>> [1] 1 2 3
>>
>> But this does not:
>>
>>> xa<- paste(x, "a", sep="")
>>> xa
>> [1] "_1_a" "1_9a" "2_9a"
>>> rank(xa)
>> [1] 2 1 3
>>
>> Cheers,
>> H.
>>
>>> sessionInfo()
>> R version 2.14.0 (2011-10-31)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
>> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
>> [7] LC_PAPER=C LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.14.0
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages_at_fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Thu 08 Dec 2011 - 10:01:18 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 08 Dec 2011 - 18:30:16 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive