Re: [Rd] Bug in read.table?

From: Ben Bolker <bbolker_at_gmail.com>
Date: Sun, 07 Nov 2010 23:38:18 +0000 (UTC)

Ben Bolker <bbolker <at> gmail.com> writes:

>
> <jgarcia <at> ija.csic.es> writes:
>
> >
> > Thanks. Yes, quote="" solves the problem.
> >
> > I would never say, however, from the documentations, that this was causing
> > the duplicate records. Rather, I would have expected some kind of
> > warning/error message.
> >
> > And, yes, I knew that, through duplicate(), R solves gracefully this
> > specific problem. Just thought this could be of interests for R devel.
> >
>
> A bit of a meta- point here: there may indeed be a bug here
> (it's the kind of obscure "corner case" that someone may not have
> tested), but it's unlikely to get noted as such and fixed unless you
> can come up with a clear analysis of what is happening and how the
> misinterpretation of quote characters is leading to duplication of
> records. (You, or someone else -- recognizing that this may be beyond
> your skill level. It might be that 'just' very careful thought
> and analysis of the behavior described in the documentation would
> explain this, or one might have to dig through source code in R or C.)
> Problems with unescaped/unrecognized quote characters are very
> common.
>
> Otherwise, this will likely be dismissed as a ("doctor, it hurts
> when I do this"; "well then, don't do that!") sort of situation.
>
> Ben Bolker

  Following up on my own point:

    The bottom line is that the internal readTableHead() command handles newlines within quoted strings differently from scan().

  Explanation:

a simpler file that replicates the problem is

a b'c"d"e
f g'h"i"j
k l'm"n"o 

(didn't want to try reading this from a textConnection -- escaping all the quotes properly would have driven me nuts).

 One of the first things that happens in read.table is that the first few lines are read with readTableHead:

  lines <- .Internal(readTableHead(file, nlines, comment.char,

       blank.lines.skip, quote, sep))

  in this case, this reads the first two lines as one line; the single quote at pos. 4 of the first line closes on pos. 4 of the second line, preventing the first new line from ending a line.

  R then pushes back two copies of the lines that have been read (this is normal behavior; I don't quite follow the logic).

  The rest of the file is read with scan(), 1 line at a time. However, there is the discrepancy between the way that readTableHead interprets new lines in the middle of quoted strings (it ignores them) and the way that scan() interprets them (it takes them as the end of the quoted string).

In particular, if the file "tmp3.txt" is as shown above, then the command

.Internal(readTableHead(file("tmp3.txt"),nlines=1L,"#",FALSE,quote="\"'",sep=""))

produces

[1] "a b'c\"d\"e\nf g'h\"i\"j"  

(i.e. it grabs the first two lines, including the \n)

and

scan(file("tmp3.txt"),nlines=1L,quote="\"'",what="")

produces

Read 2 items
[1] "a" "b'c\"d\"e"

(it terminates the line in the middle of the string opened by the single quote).

 I don't know what the consequences would be of changing readTableHead to match scan()'s behavior, or how much trouble it would be to do so.



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun 07 Nov 2010 - 23:43:59 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 09 Nov 2010 - 09:30:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive