Re: [Rd] read.csv trap

From: Ben Bolker <bbolker_at_gmail.com>
Date: Thu, 03 Mar 2011 16:17:56 +0000

Ben Bolker <bbolker <at> gmail.com> writes:

> On 02/11/2011 03:37 PM, Laurent Gatto wrote:
> > On 11 February 2011 19:39, Ben Bolker <bbolker <at> gmail.com> wrote:
> >>
> > [snip]
> >>

  Bump. Is there any opinion about this from R-core?? Will I be scolded if I submit this as a bug ... ??

> >> What is dangerous/confusing is that R silently **wraps** longer lines if
> >> fill=TRUE (which is the default for read.csv). I encountered this when
> >> working with a colleague on a long, messy CSV file that had some phantom
> >> extra fields in some rows, which then turned into empty lines in the
> >> data frame.
> >>

  [snip snip]

> >> Here is an example and a workaround that runs count.fields on the
> >> whole file to find the maximum column length and set col.names
> >> accordingly. (It assumes you don't already have a file named "test.csv"
> >> in your working directory ...)
> >>
> >> I haven't dug in to try to write a patch for this -- I wanted to test
> >> the waters and see what people thought first, and I realize that
> >> read.table() is a very complicated piece of code that embodies a lot of
> >> tradeoffs, so there could be lots of different approaches to trying to
> >> mitigate this problem. I appreciate very much how hard it is to write a
> >> robust and general function to read data files, but I also think it's
> >> really important to minimize the number of traps in read.table(), which
> >> will often be the first part of R that new users encounter ...
> >>
> >> A quick fix for this might be to allow the number of lines analyzed
> >> for length to be settable by the user, or to allow a settable 'maxcols'
> >> parameter, although those would only help in the case where the user
> >> already knows there is a problem.
> >>
> >> cheers
> >> Ben Bolker
> >>



writeLines(c("A,B,C,D",
            "1,a,b,c",
            "2,f,g,c",
            "3,a,i,j",
            "4,a,b,c",
            "5,d,e,f",
            "6,g,h,i,j,k,l,m,n"),
          con=file("test.csv"))

> >>
> >>

read.csv("test.csv")
try(read.csv("test.csv",fill=FALSE))
> >>
## assumes header=TRUE, fill=TRUE; should be a little more careful
##  with comment, quote arguments (possibly explicit)
## ... contains information about quote, comment.char, sep
Read.csv <- function(fn,sep=",",...) {
 colnames <- scan(fn,nlines=1,what="character",sep=sep,...)  ncolnames <- length(colnames)
 maxcols <- max(count.fields(fn,sep=sep,...))  if (maxcols>ncolnames) {
   colnames <- c(colnames,paste("V",(ncolnames+1):maxcols,sep=""))  }
 ## assumes you don't have any other columns labeled "V[large number]"  read.csv(fn,...,col.names=colnames)
}

Read.csv("test.csv")



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu 03 Mar 2011 - 16:24:41 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 04 Mar 2011 - 10:50:25 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive