Re: [Rd] locales and readLines

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Mon, 03 Sep 2007 10:40:39 +0100 (BST)

I think you need to delimit a bit more what you want to do. It is difficult in general to tell what encoding a text file is in, and very much harder if this is a data file containing only a small proportion of non-ASCII text, which might not even be words in a human language (but abbreviations or acronyms).

If you have experience with systems that do try to guess (e.g. Unix 'file') you will know that they are pretty fallible. There are Perl modules available, for example: I checked Encode::Guess which says

     Because of the algorithm used, ISO-8859 series and other singlebyte

         encodings do not work well unless either one of ISO-8859 is
        the only one suspect (besides ascii and utf8).

       Do not mix national standard encodings and the corresponding vendor
        encodings.

    It is, after all, just a guess. You should alway be explicit when it     comes to encodings. But there are some, especially Japanese, environment      that guess-coding is a must. Use this module with care.

I think you may have missed that the main way to specify an encoding for a file is

readLines(file("fn", encoding="latin2"))

and not the encoding arg to readLines (although the help page is quite clear that the latter does not re-encode). The latter only allows UTF-8 and latin1.

The author of a package that offers facilities to read non-ASCII text does need to offer the user a way to specify the encoding. I think suggesting that is 'an extra burden' is exceedingly negative: you could rather be thankful that R provides the facilities these days to do so. And if the package or its examples contains non-ASCII character strings, it is de rigeur for the author to consider how it might work on other people's systems.

Notice that source() already has some of the 'smarts' you are asking about if 'file' is a file and not a connection, and you could provide a similar wrapper for readLines. That is useful either when the user can specify a small set of possible encodings or when such a set can be deduced from the locale. If the concern is that file might be UTF-8 or latin1, this is often a good guess (latin1 files can be valid UTF-8 but rarely are). However, if you have Russian text which might be in one of the several 8-bit encodings, the only way I know to decide which is to see if they make sense (and if they are acronyms, they may in all the possible encodings).

BTW, to guess an encoding you really need to process all the input, so this is not appropriate for general connections, and for large files it might be better to do it external to R, e.g. via Perl etc.

I would say minimal good practice would be to

I'd leave guessing to others: as
http://www.cs.tut.fi/~jkorpela/chars.html says,

   It is hopefully obvious from the preceding discussion that a sequence of    octets can be interpreted in a multitude of ways when processed as    character data. By looking at the octet sequence only, you cannot even    know whether each octet presents one character or just part of a    two-octet presentation of a character, or something more complicated.    Sometimes one can guess the encoding, but data processing and transfer    shouldn't be guesswork.

On Fri, 31 Aug 2007, Martin Morgan wrote:

> R-developers,
>
> I'm looking for some 'best practices', or perhaps an upstream solution
> (I have a deja vu about this, so sorry if it's already been asked).
> Problems occur when a file is encoded as latin1, but the user has a
> UTF-8 locale (or I guess more generally when the input locale does not
> match R's). Here are two examples from the Bioconductor help list:
>
> https://stat.ethz.ch/pipermail/bioconductor/2007-August/018947.html
>
> (the relevant command is library(GEOquery); gse <- getGEO('GSE94'))
>
> https://stat.ethz.ch/pipermail/bioconductor/2007-July/018204.html
>
> I think solutions are:
>
> * Specify the encoding in readLines.
>
> * Convert the input using iconv.
>
> * Tell the user to set their locale to match the input file (!)
>
> Unfortunately, these (1 & 2, anyway) place extra burden on the package
> author, to become educated about locales, the encoding conventions of
> the files they read, and to know how R deals with encodings.
>
> Are there other / better solutions? Any chance for some (additional)
> 'smarts' when reading files?
>
> Martin
>

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


______________________________________________ R-devel_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel

Received on Mon 03 Sep 2007 - 09:48:48 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 03 Sep 2007 - 19:40:22 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.