Re: [Rd] read.csv trap

From: Ben Bolker <bbolker_at_gmail.com>
Date: Fri, 11 Feb 2011 17:00:30 -0500

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/11/2011 03:37 PM, Laurent Gatto wrote:

> On 11 February 2011 19:39, Ben Bolker <bbolker_at_gmail.com> wrote:

>>

> [snip]

>>
>> What is dangerous/confusing is that R silently **wraps** longer lines if
>> fill=TRUE (which is the default for read.csv). I encountered this when
>> working with a colleague on a long, messy CSV file that had some phantom
>> extra fields in some rows, which then turned into empty lines in the

>> data frame.
>>
> 
> As a matter of fact, this is exactly what happened to a colleague of
> mine yesterday and caused her quite a bit of trouble. On the other
> hand, it could also be considered as a 'bug' in the csv file. Although
> no formal specification exist for the csv format, RFC 4180 [1]
> indicates that 'each line should contain the same number of fields
> throughout the file'.
> 
> [1] http://tools.ietf.org/html/rfc4180
> 
> Best wishes,
> 
> Laurent

  Asserting that the bug is in the CSV file is logically consistent, but if this is true then the "fill=TRUE" argument (which is only needed when the lines contain different numbers of fields) should not be allowed.

 I had never seen RFC4180 before -- interesting! I note especially points 5-7 which define the handling of double quotation marks (but says nothing about single quotes or using backslashes as escape characters).

  Dealing with read.[table|csv] seems a bit of an Augean task <http://en.wikipedia.org/wiki/Augeas> (hmmm, maybe I should write a parallel document to Burns's _Inferno_ ...)

  cheers
    Ben

> 

>> Here is an example and a workaround that runs count.fields on the
>> whole file to find the maximum column length and set col.names
>> accordingly. (It assumes you don't already have a file named "test.csv"
>> in your working directory ...)
>>
>> I haven't dug in to try to write a patch for this -- I wanted to test
>> the waters and see what people thought first, and I realize that
>> read.table() is a very complicated piece of code that embodies a lot of
>> tradeoffs, so there could be lots of different approaches to trying to
>> mitigate this problem. I appreciate very much how hard it is to write a
>> robust and general function to read data files, but I also think it's
>> really important to minimize the number of traps in read.table(), which
>> will often be the first part of R that new users encounter ...
>>
>> A quick fix for this might be to allow the number of lines analyzed
>> for length to be settable by the user, or to allow a settable 'maxcols'
>> parameter, although those would only help in the case where the user
>> already knows there is a problem.
>>
>> cheers

>> Ben Bolker
>>
>> ===============
>> writeLines(c("A,B,C,D",
>> "1,a,b,c",
>> "2,f,g,c",
>> "3,a,i,j",
>> "4,a,b,c",
>> "5,d,e,f",
>> "6,g,h,i,j,k,l,m,n"),
>> con=file("test.csv"))
>>
>>
>> read.csv("test.csv")
>> try(read.csv("test.csv",fill=FALSE))
>>
>> ## assumes header=TRUE, fill=TRUE; should be a little more careful
>> ## with comment, quote arguments (possibly explicit)
>> ## ... contains information about quote, comment.char, sep
>> Read.csv <- function(fn,sep=",",...) {
>> colnames <- scan(fn,nlines=1,what="character",sep=sep,...)
>> ncolnames <- length(colnames)
>> maxcols <- max(count.fields(fn,sep=sep,...))
>> if (maxcols>ncolnames) {
>> colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))
>> }
>> ## assumes you don't have any other columns labeled "V[large number]"
>> read.csv(fn,...,col.names=colnames)
>> }
>>
>> Read.csv("test.csv")
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> 
> 
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1VsX4ACgkQc5UpGjwzenPwsgCfTtGo0kJSXhUTPcY+p7cgaiuq zHAAnikRORUhqLP9O+6M5SwyZcFEW9uT
=Rb2R
-----END PGP SIGNATURE-----



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Fri 11 Feb 2011 - 22:03:49 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 03 Mar 2011 - 19:10:26 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive