Re: [R] encodings (was Reading .csv file under linux)

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Wed, 23 Jan 2008 14:07:51 +0000 (GMT)

As a real-life example, package fdim is marked as being in latin2. However, ?fdim looks more plausible to me in latin1: I'm more confident about this as the original authors are Spanish which language is never (?) written in latin2, and I checked with a Spanish reader.

On Tue, 22 Jan 2008, Prof Brian Ripley wrote:

> On Wed, 23 Jan 2008, David Scott wrote:
>
>> On Tue, 22 Jan 2008, Prof Brian Ripley wrote:
>>
>>> On Wed, 23 Jan 2008, David Scott wrote:
>>>
>>>>
>>>> I have encountered a problem with reading a .csv file on a linux box. I
>>>> can read the file on my windows machine (under XP) but on the linux box
>>>> it
>>>> gives :
>>>>
>>>>> patients <- read.csv("../Patients.csv", header = FALSE,
>>>> + col.names = patientsNames)
>>>> Error in type.convert(data[[i]], as.is = as.is[i], dec = dec,
>>>> na.strings = character(0)) :
>>>> invalid multibyte string
>>>> Calls: read.csv -> read.table -> type.convert
>>>> Execution halted
>>>>
>>>> I am running R 2.6.1 on both machines. I tried on another linux box
>>>> running 2.5.1 and got the same problem
>>>>
>>>> I am guessing it is something to do with the character encoding. On the
>>>> linux box I have
>>>>
>>>> LANG=en_US.UTF-8
>>>
>>> So what encoding is the .csv file in? Consider the example at the end of
>>> ?file
>>>
>>> ## examples of use of encodings
>>> cat(x, file = file("foo", "w", encoding="UTF-8"))
>>> # read a 'Windows Unicode' file including names
>>> A <- read.table(file("students", encoding="UCS-2LE"))
>>>
>>> and adapt accordingly (encoding = "CP1252" is the most likely value if
>>> this works in English-language Windows).
>>>
>>
>>
>> Thanks Brian for the super-quick, super-helpful reply. The encoding you
>> suggested worked.
>>
>> I found a workaround myself too---I guessed that some plus/minus signs
>> might be the problem and replaced them and could read in the file.
>> That is just a kludge so I am using the encoding specification.
>>
>> I am a total dunce when it comes to encodings though. How do you find the
>> encoding of a file?
>
> You ask the person who gave it to you. You can't in general tell, and e.g.
> ISO-8859-1 and ISO-8859-2 are only distinguishable by someone who can read
> the contents (if it is a human language). If you have just the odd symbol
> (e.g. degree sign or plus/minus) you can be completely stuck.
>
> 'file' on Linux can usually guess if a file is UTF-8 or ISO-8859-?, but not
> of course what ? is. But guesses are based on statistical patterns and are
> good for text but not so good for data.
>
> --
> Brian D. Ripley, ripley_at_stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Wed 23 Jan 2008 - 14:12:16 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 23 Jan 2008 - 14:30:08 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive