[R] issue with "strange" characters (locale settings)

From: R.T.A.J.Leenders <r.t.a.j.leenders_at_rug.nl>
Date: Wed, 04 May 2011 11:57:46 +0200

   WinXP-x32, R-21.13.0
   Dear list,
   I have a problem that (I think) relates to the interaction between Windows    and R.
   I am trying to scrape a table with data on the Hawai'ian Islands, This is my    code:
   library(XML)
   u <- "http://en.wikipedia.org/wiki/Hawaii"    tables <- readHTMLTable(u)
   Islands <- tables[[5]]
   The output is (first set of columns):

          Island            Nickname                                           
                       > Islands
          Island            Nickname                                           
                       Location
1    Hawaiʻi[7]      The Big Island     19°34⤲N 155°30⤲W / 19.567
°N 155.5°W / 19.567; -155.5
2        Maui[8]     The Valley Isle     20°48⤲N 156°20⤲W / 20.8°N
 156.333°W / 20.8; -156.333
3 Kahoʻolawe[9]     The Target Isle       20°33⤲N 156°36⤲W / 20.55
°N 156.6°W / 20.55; -156.6
4 LÄnaÊ»i[10] The Pineapple Isle 20°50⤲N 156°56⤲W / 20.833°N 15 6.933°W / 20.833; -156.933
5 Molokaʻi[11] The Friendly Isle 21°08⤲N 157°02⤲W / 21.133°N 1 57.033°W / 21.133; -157.033
6 Oʻahu[12] The Gathering Place 21°28⤲N 157°59⤲W / 21.467°N 1 57.983°W / 21.467; -157.983
7 Kauaʻi[13] The Garden Isle 22°05⤲N 159°30⤲W / 22.083 °N 159.5°W / 22.083; -159.5
8 NiÊ»ihau[14] The Forbidden Isle 21°54⤲N 160°10⤲W / 21.9°N  160.167°W / 21.9; -160.167

   As you can see, there are "weird" characters in there. I have also tried    readHTMLTable(u, encoding = "UTF-16") and readHTMLTable(u, encoding =    "UTF-8")
   but that didn't help.
   It seems to me that there may be an issue with the interaction of the    Windows settings of the character set.    sessionInfo() gives
> sessionInfo()

   R version 2.13.0 (2011-04-13)
   Platform: i386-pc-mingw32/i386 (32-bit)    locale:
   [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252    LC_MONETARY=Dutch_Netherlands.1252

   [4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base
   other attached packages:
   [1] XML_3.2-0.2
>

   I have also attempted to let R use another setting by entering:    Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response:
> Sys.setlocale("LC_ALL", "en_US.UTF-8")
   [1] ""
   Warning message:
   In Sys.setlocale("LC_ALL", "en_US.UTF-8") :      OS reports request to set locale to "en_US.UTF-8" cannot be honored
>

   In addition, I have attempted to make the change directly from the windows    command prompt, using: "chcp 65001" and variations of that, but that didn't    change anything.
   I have searched the list and the web and have found others bringing forth a    similar issues, but have not been able to find a solution. I looks like this    is an issue of how Windows and R interact. Unfortunately, all three    computers at my disposal have this problem. It occurs both under WinXP-x32    and under Win7-x86.
   Is there a way to make R override the windows settings or can the issue be    solved otherwise?
   I have also tried other websites, and the issue occurs every time when there    is an é, ü, ä, î, et cetera in the text-to-be-scraped.    Thank you,
   Roger

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 05 May 2011 - 06:25:03 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 25 May 2011 - 23:50:09 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive