[Rd] read.csv reads more rows than indicated by wc -l

From: G See <gsee000_at_gmail.com>
Date: Wed, 19 Dec 2012 07:37:53 -0600


When I have a csv file that is more than 6 lines long, not including the header, and one of the fields is blank for the last few lines, and there is an extra comma on of the lines with the blank field, read.csv() makes creates an extra line.

I attached an example file; I'll also paste the contents here:

A,apple
A,orange
A,orange
A,orange
A,orange
A,,,
A,,



wc -l reports that this file has 7 lines

R> system("wc -l test.csv")
7 test.csv

But, read.csv reads 8.

R> read.csv("test.csv", header=FALSE, stringsAsFactors=FALSE)   V1 V2
1 A apple
2 A orange
3 A orange
4 A orange
5 A orange
6 A
7
8 A

If I increase the number of commas at the end of the line, it increases the number of rows.

This R command to read a 7 line csv:

read.csv(header=FALSE, text="A,apple

A,orange
A,orange
A,orange
A,orange
A,,,,,
A,,")

will produce this:

  V1 V2
1 A apple
2 A orange
3 A orange
4 A orange
5 A orange
6 A
7
8
9 A

But if the file has fewer than 7 lines, it doesn't increase the number of rows.

This R command to read a 6 line csv:
read.csv(header=FALSE, text="A,apple

A,orange
A,orange
A,orange
A,,,,,
A,,")

will produce this:

  V1 V2 V3 V4 V5 V6
1 A apple NA NA NA NA
2 A orange NA NA NA NA
3 A orange NA NA NA NA
4 A orange NA NA NA NA

5  A        NA NA NA NA
6  A        NA NA NA NA



Is this intended behavior?

Thanks,
Garrett See

R> version

               _
platform       x86_64-pc-linux-gnu
arch           x86_64
os             linux-gnu
system         x86_64, linux-gnu
status
major          2
minor          15.2
year           2012
month          10
day            26
svn rev        61015
language       R
version.string R version 2.15.2 (2012-10-26)
nickname       Trick or Treat

R> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 19 Dec 2012 - 13:42:56 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 21 Dec 2012 - 02:22:51 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive