Re: [R] using non-ASCII strings in R packages

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Thu 25 Jan 2007 - 09:17:48 GMT

On Thu, 25 Jan 2007, Bojanowski, M.J. (Michal) wrote:

> Hello dear useRs and wizaRds,
>
> I am currently developing a package that will enable to use
> administrative map of Poland in R plots. Among other things I wanted to
> include region names in proper Polish language so that they can be used
> in creating graphics etc. I am working on Windows and when I build the
> package it is complaining about non-ASCII characters R code files.
>
> I was wondering what would be the best way to properly implement them in
> a platform-independent way so that they can be used in computations as
> well as in producing PS, PDF and other graphic output. Unfortunately I
> have a limited knowledge of encoding schemes etc. Is it OK to include
> them in Windows-1250 encoding (default for Polish locale, as far as I
> know)? I believe this problem is frequently confronted for other
> "non-latin1" languages.

Well, infrequently, and it has been answered a few times before (including in my talk at UseR 2006,
http://www.r-project.org/useR-2006/Slides/Ripley.pdf).

> If it is not the way to go, I would be very grateful for suggestions.

Since a Japanese-language Windows machine cannot reproduce Polish non-ASCII characters, the portability you seek is not possible for reasons outside R. And many other systems cannot plot in both Polish and their native language, or at least not in the same font.

ISOLatin2 is the standard 8-bit encoding for Polish: Windows CP1250 is a superset, AFAIR. If all your users are using an 8-bit Polish locale, ISOLatin2 would be safe, but not otherwise. Even then, there is no guarantee that the Polish characters would be in the fonts used in PostScript and PDF: some fonts only cover ISOLatin1.

There is one thing you can do to make this a little more portable (and avoid the warnings). If you store the strings concerned in a text file in ISOLatin2, and read them into R at run time (e.g. when your package is loaded), you can make use of file(encoding=) or iconv() to convert them to the current encoding. That will succeed in ISOLatin2 or CP1250 or UTF-8 locales and fail otherwise.

Unfortunately that is not the end of the story for users of UTF-8 locales. as postscript() and pdf() do not support UTF-8 (as the graphics languages do not) and need to be told to use encoding="ISOLatin2.enc", and the X11 system has a mind of its own and may not show non-ASCII characters in some fonts (or worse, render them incorrectly).

The use of Unicode was supposed to reduce the impact of Babel. But implementation split into two camps (Windows with UCS-2 and Unix-alikes with UTF-8) and some important players (e.g. Adobe) have ignored it, so it has only been a very partial solution.

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu Jan 25 20:22:53 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 25 Jan 2007 - 10:30:31 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.