Re: [R] Reading a csv file row by row

From: Henrik Bengtsson <hb_at_stat.berkeley.edu>
Date: Fri 06 Apr 2007 - 15:10:43 GMT

Hi.

On 4/6/07, Yuchen Luo <realityrandom@gmail.com> wrote:
> Hi, my friends.
> When a data file is large, loading the whole file into the memory all
> together is not feasible. A feasible way is to read one row, process it,
> store the result, and read the next row.
>
>
> In Fortran, by default, the 'read' command reads one line of a file, which
> is convenient, and when the same 'read' command is executed the next time,
> the next row of the same file will be read.
>
> I tried to replicate such row-by-row reading in R.I use scan( ) to do so
> with the "skip= xxx " option. It takes only seconds when the number of the
> rows is within 1000. However, it takes hours to read 10000 rows. I think it
> is because every time R reads, it needs to start from the first row of the
> file and count xxx rows to find the row it needs to read. Therefore, it
> takes more time for R to locate the row it needs to read.

Yes, to skip rows scan() needs to locate every single row (line feed/carriage return). The only gain you get is that it does not have to parse and store the contents of those skipped lines.

One solution is to first go through the file and register the file position of the first character in every line, and then make use of this in subsequent reads. In order to do this, you have to work with an opened connection and pass that to scan instead. Rough sketch:

con <- file(pathname, open="r")

# Scan file for first position of every line rowStarts <- scanForRowStarts(con);

# Skip to a certain row and read a set of lines: seek(con, where=rowStarts, origin="start", rw="r) data <- scan(con, ..., skip=0, nlines=rowsPerChunk)

close(con)

That's the idea. The tricky part is to get scanForRowStarts() correct. After reading a line you can always query the connection for the current file position using:

  pos <- seek(con, rw="r")

so you could always iterate between readLines(con, n=1) and pos <- c(pos, seek(con, rw="r")), but there might be a faster way.

Cheers

/Henrik

>
> Is there a solution to this problem?
>
> Your help will be highly appreciated!
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat Apr 07 02:33:06 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 06 Apr 2007 - 17:31:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.