RE: [Rd] Slow IO: was [R] naive question

From: Vadim Ogranovich <vograno_at_evafunds.com>
Date: Thu 01 Jul 2004 - 06:13:18 EST


> -----Original Message-----
> From: Peter Dalgaard [mailto:p.dalgaard@biostat.ku.dk]
> Sent: Wednesday, June 30, 2004 3:10 AM
> To: Vadim Ogranovich
> Cc: r-devel@stat.math.ethz.ch
> Subject: Re: [Rd] Slow IO: was [R] naive question
>
> "Vadim Ogranovich" <vograno@evafunds.com> writes:
>
> > ...
> > I can see at least two main reasons why R's IO is so slow (I didn't
> > profile this though):
> > A) it reads from a connection char-by-char as opposed to
> the buffered
> > read. Reading each char requires a call to scanchar() which
> then calls
> > Rconn_fgetc() (with some non-trivial overhead).
> Rconn_fgetc() on its
> > part is defined somewhere else (not in scan.c) and
> therefore the call
> > can not be inlined, etc.
> > B) mkChar, which is used very extensively, is too slow.
> > ...
>
> Do you have some hard data on the relative importance of the
> above issues?

Well, here is a little analysis which sheds some light. I have a file, foo, 154M uncompressed. It contains about 3.8M lines

01/02% ls -l foo*

-rw-rw-r--    1 vograno  man      153797513 Jun 30 11:56 foo
-rw-rw-r--    1 vograno  man      21518547 Jun 30 11:56 foo.gz

# reading the files using standard UNIX utils takes no time 01/02% time cat foo > /dev/null
0.030u 0.110s 0:00.80 17.5% 0+0k 0+0io 124pf+0w
01/02% time zcat foo.gz > /dev/null
1.210u 0.030s 0:01.24 100.0% 0+0k 0+0io 90pf+0w

# compute exact line count
01/02% zcat foo.gz | wc
3794929 3794929 153797513

# now we fire R-1.8.1
# we will experiment with the gzip-ed copy since we've seen that the overhead of decompression is trivial
> nlines <- 3794929

# this exercises scanchar(), but not mkChar(), see scan() in scan.c
> system.time(scan(gzfile("foo.gz", open="r"), what="character", skip =
nlines - 1))
system.time(scan(gzfile("foo.gz", open="r"), what="character", skip = nlines - 1))
Read 1 items
[1] 67.83 0.01 68.04 0.00 0.00

# this exercises both scanchar() and mkChar() system.time(readLines(gzfile("foo.gz", open="r"), n = nlines)) [1] 110.61 0.83 112.44 0.00 0.00

It seems that scanchar() and mkChar() have comparable overheads in this case.

> ... This might be a changing balance, but I
> think you're more on the mark with the mkChar issue. (Then
> again, it is quite a bit easier to come up with buffering
> designs for Rconn_fgetc than it is to redefine STRSXP...)

First off all I agree that redefining STRSXP is not easy, but this has a potential to considerably speed up R as whole since name propagation would work faster.
As to the mkChar() in scan() there are few tricks that can help. Say we have a CSV file that contains categorical and numerical data. Here is what we can do to minimize the number of calls to mkChar:

And a final observation once we are on the scan() subject. I've found it more convenient to convert data column-by-column rather than row-by-row. When you do it column-by-column you
* figure out the type of the column only once. Ditto about the destination vector.
* maintain only one hash table for the current column, not for all columns at once.

Thanks,
Vadim



R-devel@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel Received on Thu Jul 01 06:16:31 2004

This archive was generated by hypermail 2.1.8 : Wed 03 Nov 2004 - 22:45:00 EST