Re: [Rd] encoding issues even w/o accents (background on single quotes)

From: Ross Boylan <ross_at_biostat.ucsf.edu>
Date: Fri 19 Jan 2007 - 19:39:49 GMT

On Wed, Jan 17, 2007 at 11:56:15PM -0800, Ross Boylan wrote:
> An earlier thread (in 10/2006) discussed encoding issues in the
> context of R data and the desire to represent accented characters.
>
> It matters in another setting: the output generated by R and the
> seemingly order character "'" (single quote). In particular, R CMD

            ^^^ should be "ordinary"
> check runs test code and compares the generated output to a saved file
> of expected output. This does not work reliably across encoding
> schemes. This is unfortunate, since it seems the "expected output"
> files will necessarily be wrong for someone.
>
> The problem for me was triggered by the single-quote character "'".
> On my older systems, this is encoded by 0x27, a perfectly fine ASCII
> character. That is on a Debian GNU/Linux system with LANG=en_US. On
> a newer system I have LANG=en_US.UTF-8. I don't recall whether
> this was a deliberate choice on my part, or simply reflects changing
> defaults for the installer. (Note the earlier thread referred to the
> Debian-derived Ubuntu systems as having switched to UTF-8). Under
> UTF-8 the same character is encoded in the 3-byte sequence 0xE28098
> (which seems odd; I thought the point of UTF-8 was that ASCII was a
> legitimate subset).

Apparently quoting, particularly single quotes, is a can of worms: http://www.cl.cam.ac.uk/~mgk25/ucs/quotes.html When Unicode is available (which would be the case with UTF-8), particular non-ASCII characters are recommended for single quoting. The 3 byte sequence is the UTF-8 encoding of x2018, the recommended left single quote mark.

See http://en.wikipedia.org/wiki/UTF-8 on UTF-8 encoding.

This is more than I or, probably, you ever wanted to know about this issue!

Ross

>
> The coefficient printing methods in the stats package use the
> single-quote in the key explaining significance levels:
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>
> I suppose one possible work-around for R CMD check would be to set the
> encoding to some standard value before it runs tests, but that has
> some drawbacks. It doesn't work for packages needing a different
> encoding (but perhaps the package could specify an encoding to use by
> default?)(*), It will leave the output files looking weird on systems
> with a different encoding. It will get messed up if one generates the
> files under the wrong encoding.
>
> And none of this addresses stuff beyond the context of output file
> comparison in R CMD check.
>
> Any thoughts?
>
> Ross Boylan
>
>
> * From the R Extensions document, discussing the DESCRIPTION file:
> If the `DESCRIPTION' file is not entirely in ASCII it should contain
> an `Encoding' field specifying an encoding. This is currently used as
> the encoding of the `DESCRIPTION' file itself, and may in the future be
> taken as the encoding for other documentation in the package. Only
> encoding names `latin1', `latin2' and `UTF-8' are known to be portable.
>
> I would not expect that the test output files be considered
> "documentation," but I suppose that's subject to interpretation.



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Jan 20 06:43:43 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 19 Jan 2007 - 20:31:16 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.