[R] Data bug that read.csv doesn't like

From: Randall Johnson [Contr] <rjohnson_at_ncifcrf.gov>
Date: Thu, 20 Dec 2007 14:24:01 -0500

I have a bug in my data that read.csv doesn't like, but _only_ when specifying "na.strings = 'missing'". If I delete the offending Chinese characters the problem goes away as well. I'm satisfied that the problems with this data file are fixed, but is there anything I can to do avoid this in the future (other than avoiding Chinese characters). Any ideas as to what is going on here? I've attached the piece of the data file I used for the example below.


> read.csv('../data/tmp.csv')

   Smoking_status Age_start_smoking         Pack_day
1              0
2              0
3        missing           missing          missing
4              1                18 \xc9\xd9\xc1\xbf
5              1                20                1

> read.csv('../data/tmp.csv', na.strings = 'missing')
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0)) :

   invalid multibyte string
> sessionInfo()

R version 2.6.1 (2007-11-26)


attached base packages:
[1] stats graphics grDevices utils datasets methods base

Randall C Johnson
Bioinformatics Analyst
SAIC-Frederick, Inc (Contractor)
Laboratory of Genomic Diversity
NCI-Frederick, P.O. Box B
Bldg 560, Rm 11-85
Frederick, MD 21702
Phone: (301) 846-1304
Fax: (301) 846-1686

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 20 Dec 2007 - 19:30:35 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 20 Dec 2007 - 20:30:20 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.