Re: [Rd] read.table / type.convert with NA values

From: Peter Ehlers <ehlers_at_ucalgary.ca>
Date: Tue, 29 Jun 2010 19:45:35 -0600

Is there a compelling reason to have strip.white default to FALSE? It seems to me that it would be more common to want the TRUE case.
Having said that, I must confess that I've never had the problem Erik describes.

   -Peter Ehlers

On 2010-06-29 17:14, Matt Shotwell wrote:
> The document RFC 4180 (which appears to be the CSV standard used by R,
> see ?read.table) considers spaces to be part of the fielded value. Some
> have taken this to mean that all white space characters should be
> considered part of the fielded value, though the RFC is not explicit
> here. Hence, this behavior is in compliance with the "standard" for CSV
> files. It seems that R treats '\t' (and perhaps all?) separated value
> files the same way by default.
>
> The RFC is very short and easy to read if you're interested.
> http://tools.ietf.org/html/rfc4180
>
> -Matt
>
> On Tue, 2010-06-29 at 16:41 -0400, Erik Iverson wrote:
>> Hello,
>>
>> While assisting a fellow R-helper off list, I narrowed down an issue he
>> was having to the following behavior of type.convert, called through
>> read.table. This is using R 2.10.1, if newer versions don't exhibit
>> this behavior, apologies.
>>
>> # generates numeric vector
>> > type.convert(c("123.42", "NA"))
>> [1] 123.42 NA
>>
>> # generates a numeric vector, notice the space before 123.42
>> > type.convert(c(" 123.42 ", "NA"))
>> [1] 123.42 NA
>>
>> # generates factor, notice the space before NA
>> # note that the 2nd element is actually " NA", not a true NA value
>> > type.convert(c("123.42", " NA"))
>> [1] 123.42 NA
>> Levels: 123.42 NA
>>
>>
>> How can this affect read.table/read.csv use 'in the wild'?
>>
>> This gentleman had a data file that was
>>
>> 1) delimited by something other than white space, CSV in his case
>> 2) contained missing values, designated by NA in his case
>> 3) contained white space between delimiters and data values, e.g.,
>>
>> NA, NA, 4.5, NA
>>
>> as opposed to
>>
>> NA,NA,4.5,NA
>>
>>
>> With these 3 conditions met, read.table gives type.convert a character
>> vector like my third example above, and ultimately he got a data.frame
>> consisting of only factors when we were expecting numeric columns. This
>> was easily fixed either by modifying the read.csv function call to
>> specify colClasses directly, or in his case, strip.white = TRUE did the
>> job just fine.
>>
>> I believe the confusion stems from the fact that with no NA values in
>> our data file, this would work as we would expect. The introduction of
>> what we thought were NA values changed the behavior. In reality, these
>> were not being treated as NA values by read.table/type.convert. The
>> question is, should they be in this case?
>>
>> This behavior of read.table/type.convert may very well be what is
>> expected/needed. If so, this note could still be of use to someone in
>> the future if they stumble upon similar behavior. The fact I wasn't
>> able to uncover anyone who asked about it on list before probably means
>> the situation is rare.
>>
>> Best Regards,
>> Erik Iverson
>>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 30 Jun 2010 - 01:49:07 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 30 Jun 2010 - 08:11:32 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive