Re: [Rd] read.csv trap

From: Laurent Gatto <laurent.gatto_at_gmail.com>
Date: Fri, 11 Feb 2011 20:37:46 +0000

On 11 February 2011 19:39, Ben Bolker <bbolker_at_gmail.com> wrote:
>


[snip]
>
> What is dangerous/confusing is that R silently **wraps** longer lines if
> fill=TRUE (which is the default for read.csv).  I encountered this when
> working with a colleague on a long, messy CSV file that had some phantom
> extra fields in some rows, which then turned into empty lines in the
> data frame.
>

As a matter of fact, this is exactly what happened to a colleague of mine yesterday and caused her quite a bit of trouble. On the other hand, it could also be considered as a 'bug' in the csv file. Although no formal specification exist for the csv format, RFC 4180 [1] indicates that 'each line should contain the same number of fields throughout the file'.

[1] http://tools.ietf.org/html/rfc4180

Best wishes,

Laurent

>  Here is an example and a workaround that runs count.fields on the
> whole file to find the maximum column length and set col.names
> accordingly.  (It assumes you don't already have a file named "test.csv"
> in your working directory ...)
>
>  I haven't dug in to try to write a patch for this -- I wanted to test
> the waters and see what people thought first, and I realize that
> read.table() is a very complicated piece of code that embodies a lot of
> tradeoffs, so there could be lots of different approaches to trying to
> mitigate this problem. I appreciate very much how hard it is to write a
> robust and general function to read data files, but I also think it's
> really important to minimize the number of traps in read.table(), which
> will often be the first part of R that new users encounter ...
>
>  A quick fix for this might be to allow the number of lines analyzed
> for length to be settable by the user, or to allow a settable 'maxcols'
> parameter, although those would only help in the case where the user
> already knows there is a problem.
>
>  cheers
>    Ben Bolker
>
> ===============
> writeLines(c("A,B,C,D",
>             "1,a,b,c",
>             "2,f,g,c",
>             "3,a,i,j",
>             "4,a,b,c",
>             "5,d,e,f",
>             "6,g,h,i,j,k,l,m,n"),
>           con=file("test.csv"))
>
>
> read.csv("test.csv")
> try(read.csv("test.csv",fill=FALSE))
>
> ## assumes header=TRUE, fill=TRUE; should be a little more careful
> ##  with comment, quote arguments (possibly explicit)
> ## ... contains information about quote, comment.char, sep
> Read.csv <- function(fn,sep=",",...) {
>  colnames <- scan(fn,nlines=1,what="character",sep=sep,...)
>  ncolnames <- length(colnames)
>  maxcols <- max(count.fields(fn,sep=sep,...))
>  if (maxcols>ncolnames) {
>    colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))
>  }
>  ## assumes you don't have any other columns labeled "V[large number]"
>  read.csv(fn,...,col.names=colnames)
> }
>
> Read.csv("test.csv")
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
[ Laurent Gatto | slashhome.be ]

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Fri 11 Feb 2011 - 20:43:50 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 11 Feb 2011 - 22:10:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive