Re: [R] Unicode characters (R 2.7.0 on Windows XP SP3 and Hardy Heron)

From: Hans-Jörg Bibiko <bibiko_at_eva.mpg.de>
Date: Sun, 01 Jun 2008 21:50:59 +0200

On 31.05.2008, at 00:11, Prof Brian Ripley wrote:

> On Fri, 30 May 2008, Duncan Murdoch wrote:

>> But I think with Brian Ripley's work over the last while, R for  
>> Windows actually handles utf-8 pretty well.  (It might not guess  
>> at that encoding, but if you tell it that's what you're using...)

Yes. I already mentioned that there was a big step from R 2.6 to R 2.7 for Windows regarding the support of UTF-8.

> R passes around, prints and plots UTF-8 character data pretty well,
> but it translates to the native encoding for almost all character-
> level manipulations (and not just on Windows). ?Encoding spells
> out the exceptions (and I think the original poster had not read
> it). As time goes on we may add more, but it is really tedious
> (and somewhat error-prone) to have multiple paths through the code
> for different encodings (and different OSes do handle these
> differently -- Windows' use of UTF-16 means that one character may
> not be one wchar_t).

R is becoming more and more popular amongst philologists, linguistics etc. It is very nice to have one software environment to gather, analyze, and visualize data based on texts. But, e.g. linguists are dealing very often with more than one language at the same time. That's why they have to use an Unicode encoding. In R they have to use any functions dealing with characters, like nchar, strsplit, grep/gsub, to lower/upper case etc. These functions are, more or less, based on the underlying locale settings. But why?

It is a very very painful task to write functions for different encodings on different platforms. Thus I wonder whether it would be possible to switch internally to one Unicode encoding. If one considers e.g. the memory usage UTF-8 would be an option. Of course, such a change will be REALLY a BIG challenge in terms of effort, speed, compatibility, etc. This would also mean to avoid the usage of system libraries.
Maybe this would be a task for R 4.0 or it will be my eternal private dream :)

OK. Let me be a bit more realistic.
An other issue is the used regular expression engine. On a Mac or UNIX machine one can set a UTF-8 locale. Fine. But these locales aren't available under Windows (yet?). Maybe it's worth to have a look at other regexp engines like Oniguruma ( http://www.geocities.jp/ kosako3/oniguruma/ ). It supports, among others, all Unicode encodings. It is used in many applications. I do not know how difficult it will be to implement such a library in R. But this would solve, I guess, 80% of the problems of R users who are interested in text analyzing. nchar, strsplit, grep etc. could make usage of it. Maybe one could write such a package for Windows (maybe also for Mac/ UNIX, because Oniguruma has some very nice additional features). Of course, a string should be piped as an UTF-8 byte stream to the Oniguruma lib, and I do not know whether this is easily possible in R for Windows.

Once again, thanks for all the effort done to set up such a wonderful piece of software.

--Hans



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 02 Jun 2008 - 01:09:09 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 02 Jun 2008 - 01:30:35 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive