[Rd] encoding issues even w/o accents

From: Ross Boylan <ross_at_biostat.ucsf.edu>
Date: Thu 18 Jan 2007 - 07:56:15 GMT

An earlier thread (in 10/2006) discussed encoding issues in the context of R data and the desire to represent accented characters.

It matters in another setting: the output generated by R and the seemingly order character "'" (single quote). In particular, R CMD check runs test code and compares the generated output to a saved file of expected output. This does not work reliably across encoding schemes. This is unfortunate, since it seems the "expected output" files will necessarily be wrong for someone.

The problem for me was triggered by the single-quote character "'". On my older systems, this is encoded by 0x27, a perfectly fine ASCII character. That is on a Debian GNU/Linux system with LANG=en_US. On a newer system I have LANG=en_US.UTF-8. I don't recall whether this was a deliberate choice on my part, or simply reflects changing defaults for the installer. (Note the earlier thread referred to the Debian-derived Ubuntu systems as having switched to UTF-8). Under UTF-8 the same character is encoded in the 3-byte sequence 0xE28098 (which seems odd; I thought the point of UTF-8 was that ASCII was a legitimate subset).

The coefficient printing methods in the stats package use the single-quote in the key explaining significance levels: Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I suppose one possible work-around for R CMD check would be to set the encoding to some standard value before it runs tests, but that has some drawbacks. It doesn't work for packages needing a different encoding (but perhaps the package could specify an encoding to use by default?)(*), It will leave the output files looking weird on systems with a different encoding. It will get messed up if one generates the files under the wrong encoding.

And none of this addresses stuff beyond the context of output file comparison in R CMD check.

Any thoughts?

Ross Boylan

I would not expect that the test output files be considered "documentation," but I suppose that's subject to interpretation.

