Re: [R] Umlaut read from csv-file

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Sun, 09 Nov 2008 05:25:09 +0000 (GMT)

On Sat, 8 Nov 2008, Heinz Tuechler wrote:

> At 08:01 08.11.2008, Prof Brian Ripley wrote:
>> We have no idea what you understood (you didn't tell us), but the help says
>>
>> encoding: character vector. The encoding(s) to be assumed when 'file'
>> is a character string: see 'file'. A possible value is
>> '"unknown"': see the ‚??Details‚??.
>>
>> ...
>> This paragraph applies if 'file' is a filename (rather than a
>> connection). If 'encoding = "unknown"', an attempt is made to
>> guess the encoding. The result of 'localeToCharset()' is used as
>> a guide. If 'encoding' has two or more elements, they are tried
>> in turn until the file/URL can be read without error in the trial
>> encoding.
>>
>> So source(encoding="latin1") says the file is encoded in Latin-1 and should
>> be re-encoded if necessary (e.g. in UTF-8 locale).
>>
>> Setting the Encoding of parsed character strings is not mentioned.
>>
>> You could have written out a data frame with write.csv() and re-read it
>> with read.csv(encoding = "latin1"): that was the workaround you were given
>> earlier (not to use source).
>
> Thank you for this explanation. I felt that I did not understand the help
> page of source() and I hoped, encoding='latin1' would have the same effect as
> in read.csv(), but rethinking it, I see that it would conflict with the
> primary functionality of source().
> Earlier I tried writing the data.frame with write.csv and re-reading it. This
> works, but additional information like labels(), I have to tranfer in a
> second step.
> The best way I could immagine, would be some function, which marks every
> character string in the whole structure of a data.frame, including all
> attributes, as latin1.

I think it is possible that

con <- file("foo")
source(con, encoding="latin1")
close(foo)

will also do what you want, although that's an udocumented side effect.

But all of this should be unnecessary in R-patched (although it is possible that there are other quirks with unmarked strings lurking in the shadows, there are no other obvious changes from 2.7.2).

>
>> On Sat, 8 Nov 2008, Heinz Tuechler wrote:
>>
>>> At 16:52 07.11.2008, Prof Brian Ripley wrote:
>>>> On Fri, 7 Nov 2008, Peter Dalgaard wrote:
>>>>
>>>>> Heinz Tuechler wrote:
>>>>>> Dear Prof.Ripley!
>>>>>> Thank you very much for your attention. In the given example
>>>>>> Encoding(),
>>>>>> or the encoding parameter of read.csv solve the problem. I hope your
>>>>>> patch will solve also the problem, when I read a spss file by
>>>>>> spss.get(), since this function has no encoding parameter and my real
>>>>>> problem originated there.
>>>>> read.spss() (package foreign) does have a reencode argument, though; and
>>>>> this is called by spss.get(), so it looks like an easy hack to add it
>>>>> there.
>>>> Yes, older software like spss.get needs to get updated for the
>>>> internationalization age. Modifying it to have a ... argument passed to
>>>> read.spss would be a good idea (and future-proofing).
>>>> In cases like this it is likely that the SPSS file does contain its
>>>> encoding (although sometimes it does not and occasionally it is wrong),
>>>> so it is helpful to make use of the info if it is there. However, the
>>>> default is read.spss(reencode=NA) because of the problems of assuming
>>>> that the info is correct when it is not are worse.
>>>
>>> The cause, why I tried the example below was to solve the encoding by
>>> dumping and then re-sourcing a data.frame with the encoding parameter set
>>> to latin1. As you can see, source(x, encoding='latin1') does not have the
>>> effect I expected. Unfortunately I do not have any idea, what I understood
>>> wrong regarding the meaning of encoding='latin1'.
>>>
>>> Heinz T√ľchler
>>>
>>>
>>> us <- c("a", "b", "c", "√§", "√∂", "√ľ")
>>> Encoding(us)
>>> [1] "unknown" "unknown" "unknown" "latin1" "latin1" "latin1"
>>> dump('us', 'us_dump.txt')
>>> rm(us)
>>> source('us_dump.txt', encoding='latin1')
>>> us
>>> [1] "a" "b" "c" "√§" "√∂" "√ľ"
>>> Encoding(us)
>>> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
>>> unlink('us_dump.txt')
>>>
>>>
>>>
>>>
>>>> --
>>>> Brian D. Ripley, ripley_at_stats.ox.ac.uk
>>>> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
>>>> University of Oxford, Tel: +44 1865 272861 (self)
>>>> 1 South Parks Road, +44 1865 272866 (PA)
>>>> Oxford OX1 3TG, UK Fax: +44 1865 272595
>>>
>>> ______________________________________________
>>> R-help_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Brian D. Ripley, ripley_at_stats.ox.ac.uk
>> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford, Tel: +44 1865 272861 (self)
>> 1 South Parks Road, +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
>
>

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

Received on Sun 09 Nov 2008 - 05:31:46 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 09 Nov 2008 - 11:30:22 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive