Re: [R] Russian language in R

From: Duncan Murdoch <murdoch.duncan_at_gmail.com>
Date: Mon, 16 May 2011 08:41:26 -0400

On 16/05/2011 8:33 AM, Lyolya wrote:
> Dear Duncan,
>
> Thank you very much for your reply!
>
> I have tried what you have suggested. R was definitely assuming a different
> text encoding, and after trying the l10n_info() command, I got the
> following:
>
> l10n_info()
> $MBCS
> [1] TRUE
>
> $`UTF-8`
> [1] TRUE
>
> $`Latin-1`
> [1] FALSE
>
> My data is a dataframe (stored both in .xls and .dbf files) that represents
> the secondary housing market for Moscow for a given period of time. The
> problem is that the factors are given by Russian strings (those like general
> condition of the dwelling and the material the house is built of), and R
> does not read them correctly. This makes the analysis really complicated.
>
> In order to read the file, I do the following:
>
> require(foreign)
> MSL_1010<- read.dbf("MSL_1010.dbf") # I tried both as.is=TRUE and FALSE
>
> and then when it comes to strings it reads something like: \x96\x80\x8e.

I'm not familiar with Russian encodings. If you know what encoding is in the file, you may be able to use iconv() to convert it to UTF-8, which the l10n_info function says is native to your system. To simplify things, use

read.dbf( "MSL_1010.dbf", as.is = TRUE)

so that you don't have to worry about factors and factor names. Then try

iconv(x, from="KOI8-R", to="UTF-8")

where x is one of the character vectors with bad characters. If that doesn't work, try a different possible encoding (e.g. cp1251).

Duncan Murdoch

>
> On 14 May 2011 01:08, Duncan Murdoch<murdoch.duncan@gmail.com> wrote:
>
> > On 13/05/2011 4:57 PM, lyolya wrote:
> >
> >> Hello,
> >>
> >> I am experiencing a problem in reading a database in Russian. The problem
> >> appears when it comes to char variables. I have already tried changing the
> >> encoding, i.e.
> >>
> >> options(encoding="UTF-8")
> >>
> >> and
> >>
> >> options(encoding="KOI8-R")
> >>
> >> but every time there appear to be something unreadable in the data frame,
> >> like \x82\xa2\xae\xef etc.
> >>
> >> Could you please answer whether it is possible to operate with Russian
> >> strings in R, and, if yes, how to get to do that. Thank you, in advance.
> >>
> >
> > Yes, it is possible. You can test it using a text editor that supports
> > Russian. Just put
> >
> > x<- " some Russian text "
> >
> > into the file, the use source() to read the filename. Two things are
> > likely outcomes:
> >
> > x will be defined to be a string holding Russian text, and it will display
> > properly.
> >
> > OR
> >
> > it will be defined to be a string with lots of escapes or mis-displayed
> > characters in it. In the latter case, the problem is that R is assuming a
> > different encoding than your text editor. The l10n_info() will display
> > information about what R is expecting.
> >
> > If none of the above helps you to get your code working, then you'll have
> > to give details on exactly what you're doing to read the file, and exactly
> > what is in the file.
> >
> > Duncan Murdoch
> >
>
>
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 16 May 2011 - 12:44:14 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 16 May 2011 - 13:50:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive