[Rd] read.table / type.convert with NA values

From: Erik Iverson <eriki_at_ccbr.umn.edu>
Date: Tue, 29 Jun 2010 15:41:57 -0500


Hello,

While assisting a fellow R-helper off list, I narrowed down an issue he was having to the following behavior of type.convert, called through read.table. This is using R 2.10.1, if newer versions don't exhibit this behavior, apologies.

# generates numeric vector
> type.convert(c("123.42", "NA"))

[1] 123.42 NA

# generates a numeric vector, notice the space before 123.42
> type.convert(c(" 123.42 ", "NA"))

[1] 123.42 NA

# generates factor, notice the space before NA # note that the 2nd element is actually " NA", not a true NA value
> type.convert(c("123.42", " NA"))

[1] 123.42 NA
Levels: 123.42 NA

How can this affect read.table/read.csv use 'in the wild'?

This gentleman had a data file that was

  1. delimited by something other than white space, CSV in his case
  2. contained missing values, designated by NA in his case
  3. contained white space between delimiters and data values, e.g.,

NA, NA, 4.5, NA

as opposed to

NA,NA,4.5,NA

With these 3 conditions met, read.table gives type.convert a character vector like my third example above, and ultimately he got a data.frame consisting of only factors when we were expecting numeric columns. This was easily fixed either by modifying the read.csv function call to specify colClasses directly, or in his case, strip.white = TRUE did the job just fine.

I believe the confusion stems from the fact that with no NA values in our data file, this would work as we would expect. The introduction of what we thought were NA values changed the behavior. In reality, these were not being treated as NA values by read.table/type.convert. The question is, should they be in this case?

This behavior of read.table/type.convert may very well be what is expected/needed. If so, this note could still be of use to someone in the future if they stumble upon similar behavior. The fact I wasn't able to uncover anyone who asked about it on list before probably means the situation is rare.

Best Regards,
Erik Iverson



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 29 Jun 2010 - 20:48:30 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 30 Jun 2010 - 01:51:28 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive