Re: [Rd] Bug in read.table?

From: peter dalgaard <pdalgd_at_gmail.com>
Date: Tue, 16 Nov 2010 14:04:16 +0100

On Nov 16, 2010, at 02:59 , Ben Bolker wrote:

> Ben Bolker <bbolker <at> gmail.com> writes:
> 

>>
>> Ben Bolker <bbolker <at> gmail.com> writes:
>>
>>> 
>>> 

>>
>> Can simplify this still farther:
>>
>> a b'c
>> d e'f
>> g h'i
> 
>  This example file leads to duplicate lines.
> Arguably it should have behavior analogous to:
> 

>> scan(what="")
> 1: a b'c
> 3: d e'f
> 5: g h'i
> 7: Read 6 items
> [1] "a"   "b'c" "d"   "e'f" "g"   "h'i"
> 
> 

>>
>>> One of the first things that happens in read.table is that
>>> the first few lines are read with readTableHead:
>>> 
>>>  lines <- .Internal(readTableHead(file, nlines, comment.char, 
>>>       blank.lines.skip, quote, sep))
>>> 

>> in this case, this reads the first two lines as one line;
>> the single quote at pos. 4 of the first line closes on pos.
>> 4 of the second line, preventing the first new line from
>> ending a line.
>>
>> R then pushes back two copies of the lines that have
>> been read (this is normal behavior; I don't quite follow the
>> logic).
>>
>> The rest of the file is read with scan(), 1 line at a time.
>> However, there is the discrepancy between the way
>> that readTableHead interprets new lines in the middle of
>> quoted strings (it ignores them) and the way that scan()
>> interprets them (it takes them as the end of the quoted string).
> 
> 
>  Ping?
>  I think this counts as a small, but real, bug. Should I go ahead
> and report it as such, or would someone explain why it's not a bug?
> 

I think it can be defended to file as a bug, but it is tricky to pinpoint exactly what the issue is. E.g., notice that adding a few spaces changes the behaviour of scan() considerably:

> scan(what="")

1:  a b 'c
1: d e' f
5: g h' i
8: 
Read 7 items
[1] "a"      "b"      "c\nd e" "f"      "g"      "h'"     "i"     

(I'm confused... What is it that we really want here?)

Also, as you noted originally, beware the "Well don't do that then" aspect...

-- 
Peter Dalgaard
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes_at_cbs.dk  Priv: PDalgd_at_gmail.com

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Tue 16 Nov 2010 - 13:08:24 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 16 Nov 2010 - 13:50:21 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive