[R] Unicode characters (R 2.7.0 on Windows XP SP3 and Hardy Heron)

From: Stefan Th. Gries <stgries_at_gmail.com>
Date: Fri, 30 May 2008 08:14:54 -0700


Hi all

Four questions regarding Unicode.

Three Windows questions. I am using

> R.version

platform       i386-pc-mingw32
arch           i386
os             mingw32
system         i386, mingw32
status
major          2
minor          7.0
year           2008
month          04
day            22
svn rev        45424
language       R

version.string R version 2.7.0 (2008-04-22)

# I loaded the file
# <http://www.linguistics.ucsb.edu/faculty/stgries/teaching/russ_corp.txt>
# into R, and this works fine.

x<-scan(choose.files(), what="char", sep="\n", quote="", comment.char="", encoding="UTF-8")

# My problems are the following:
# 1 strsplit

# This does not work:
words.1<-unlist(strsplit(corpus.file, "[-!;:\'\"\\?\\. ]+", perl=T))

# - words.1[173] should be "фирме", as in corpus.file[6]
# but it is "фирме"
# - words.1[208] should be "Торговли", as in corpus.file[13]
# but it is "Торговли"
# - words.1[214] should be "клиентов", as in corpus.file[14]
# but it is "Торговли"

# 2 entering Unicode characters into R: I want to search for,
# say, "для". So I try to define it as follows,
# but this doesn't work:
(x123<-"\u0434\u043b\u044F")

# I can define each individual character
(x1<-"\u0434"); (x2<-"\u043b"); (x3<-"\u044F")

# and each pair of character

(x12<-"\u0434\u043b")
(x13<-"\u0434\u044F")
(x23<-"\u043b\u044F")

# but not all three ... the last one gets skipped.
# why's that and how do I do it?

# 3 defining Unicode character ranges: in each of the following,
# the last bracket does not get included (even if it gets defined
# as a Unicode character, too):

russ.char.yes<-"[\u0401\u0410-\u044F\u0451]" # all Russian Cyrillics
russ.char.no<-"[^\u0401\u0410-\u044F\u0451]" # other characters
russ.char.capit<-"[\u0410-\u042F\u0451]" # capital Russian Cyrillics
russ.char.small<-"[\u0430-\u044F\u0401]" # small Russian Cyrillics

# I can do that all on Linux, but this arises in a context where
# many other character processing issues are explained for Mac,
# Linux, *and* Windows, and I'd hate to have to say "this one
# thing, you can't do on Windows"

One Linux question. I am using Ubuntu Hardy Heron:

> sessionInfo()
R version 2.7.0 (2008-04-22)
i486-pc-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

# strange(?) behavior of word boundary characters:
# I understand why these work ...
grep("\\bмолод", "а молодость", perl=F, value=T) # OK
# [1] "а молодость"
gsub("\\bмолод", ">XX<", "а молодость", perl=F) # OK
# [1] "а >XX<ость"

# but why does "\\b" not work with perl=T?
grep("\\bмолод", "а молодость", perl=T, value=T) # FAIL
# character(0)

gsub("\\bмолод", ">XX<", "а молодость", perl=T) # FAIL
# [1] "а молодость"

Any pointers would be much appreciated and acknowledged ... STG



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 02 Jun 2008 - 02:38:16 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 02 Jun 2008 - 04:30:35 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive