Re: [R] a question of alphabetical order

From: Hans-Joerg Bibiko <bibiko_at_eva.mpg.de>
Date: Wed, 16 Apr 2008 10:49:54 +0200

Hi,

as already mentioned, sorting could be a pain.

My solution to that is to write my own "order" routine for a given language.
The idea is to transform the UTF-8 string into ASCII in such a way that the built-in order routine outputs the desired result. But this could be a very stony way.

Example for Spanish (please correct me if I'm wrong):

-accents are ignored
-ll is one single entity and comes after l (ludar comes before llave)
-ch is one single entity and comes after c

The only thing I do not know if it could happen that a 'll' is not one entity but two (maybe the result of the combination of two nouns). If so then the entire story will be much more complicated.

Now the big question is how to delete all these accents in etc. to get aaynu. (technically spoken canonical decomposition of a Unicode string NFKD)
One possible way is to use a scripting language which can handle it. The only language I know which can do it as default is python. For ruby, perl one has to install an additional library.

On a Mac system python is installed as default; on Windows not. If this ordering is also an issue for Windows users then one has to install it in beforehand.

The code comes here:

orderES <- function(x) {

#decomposes all accented characters
     str <- NKFD(x)

#all combining diacritics

     nonChars <- c(768:879)
     pattern <- paste("[", intToUtf8(as.integer(nonChars)), "]", sep="")


#delete all combining diacritics
str <- gsub(pattern, "", str)
#transform ll an ch to l{ and c{ ({ comes after z)
str <- gsub("ll", "l{", gsub("ch", "c{", str)) order(str)

}

NKFD <- function(x) {

     system(paste("echo -en '# coding=utf-8\nimport unicodedata\nfor i,v in enumerate([\"" , paste(x, collapse="\", \""), "\"]):print unicodedata.normalize(\"NFKD\",unicode(v, \"UTF-8\")).encode(\"UTF-8\")'|python -", sep=""), intern=T) }

Notes to NFKD rountine:
- only works if R's environment is set to UTF-8! - for instance a Danish won't be decompose to o / (these cases has to be solved manually)
- this routine is not very fast

Cheers,

--Hans



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 16 Apr 2008 - 08:55:13 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 16 Apr 2008 - 11:30:29 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive