Re: [Rd] read.csv

From: Petr Savicky <savicky_at_cs.cas.cz>
Date: Tue, 16 Jun 2009 20:09:01 +0200

On Sun, Jun 14, 2009 at 09:21:24PM +0100, Ted Harding wrote:
> On 14-Jun-09 18:56:01, Gabor Grothendieck wrote:
> > If read.csv's colClasses= argument is NOT used then read.csv accepts
> > double quoted numerics:
> >
> > 1: > read.csv(stdin())
> > 0: A,B
> > 1: "1",1
> > 2: "2",2
> > 3:
> > A B
> > 1 1 1
> > 2 2 2
> >
> > However, if colClasses is used then it seems that it does not:
> >
> >> read.csv(stdin(), colClasses = "numeric")
> > 0: A,B
> > 1: "1",1
> > 2: "2",2
> > 3:
> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> > na.strings, :
> > scan() expected 'a real', got '"1"'
> >
> > Is this really intended? I would have expected that a csv file
> > in which each field is surrounded with double quotes is acceptable
> > in both cases. This may be documented as is yet seems undesirable
> > from both a consistency viewpoint and the viewpoint that it should
> > be possible to double quote fields in a csv file.
>
> Well, the default for colClasses is NA, for which ?read.csv says:
> [...]
> Possible values are 'NA' (when 'type.convert' is used),
> [...]
> and then ?type.convert says:
> This is principally a helper function for 'read.table'. Given a
> character vector, it attempts to convert it to logical, integer,
> numeric or complex, and failing that converts it to factor unless
> 'as.is = TRUE'. The first type that can accept all the non-missing
> values is chosen.
>
> It would seem that type 'logical' won't accept integer (naively one
> might expect 1 --> TRUE, but see experiment below), so the first
> acceptable type for "1" is integer, and that is what happens.
> So it is indeed documented (in the R[ecursive] sense of "documented" :))
>
> However, presumably when colClasses is used then type.convert() is
> not called, in which case R sees itself being asked to assign a
> character entity to a destination which it has been told shall be
> integer, and therefore, since the default for as.is is
> as.is = !stringsAsFactors
> but for this ?read.csv says that stringsAsFactors "is overridden
> bu [sic] 'as.is' and 'colClasses', both of which allow finer
> control.", so that wouldn't come to the rescue either.
>
> Experiment:
> X <-logical(10)
> class(X)
> # [1] "logical"
> X[1]<-1
> X
> # [1] 1 0 0 0 0 0 0 0 0 0
> class(X)
> # [1] "numeric"
> so R has converted X from class 'logical' to class 'numeric'
> on being asked to assign a number to a logical; but in this
> case its hands were not tied by colClasses.
>
> Or am I missing something?!!

In my opinion, you explain, how it happens that there is a difference in the behavior between
  read.csv(stdin(), colClasses = "numeric") and
  read.csv(stdin())
but not, why it is so.

The algorithm "use the smallest type, which accepts all non-missing values" may well be applied to the input values either literally or after removing the quotes. Is there a reason, why
  read.csv(stdin(), colClasses = "numeric") removes quotes from the input values and   read.csv(stdin())
does not?

Using double-quote characters is a part of the definition of CSV file, see, for example
  http://en.wikipedia.org/wiki/Comma_separated_values where one may find
  Fields may always be enclosed within double-quote characters, whether necessary or not.

Petr.



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 16 Jun 2009 - 18:27:37 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 16 Jun 2009 - 18:36:15 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive