From: Vadim Ogranovich <vograno_at_evafunds.com>

Date: Fri 02 Jul 2004 - 11:49:49 EST

Richard,

Thank you for the analysis. I don't think there is an inconsistency
between the factor of 4 you've found in your example and 20 - 50 I found
in my data. I guess the major cause of the difference lies with the
structure of your data set. Specifically, your test data set differs
from mine in two respects:

* you have fewer lines, but each line contains many more fields (12500 *
800 in your case and 3.8M * 10 in my)

* all of your data fields are doubles, not strings. I have a mixture of
doubles and strings.

I posted a more technical message to r-devel where I discussed possible reasons for the IO slowness. One of them is that R is slow at making strings. So if you try to read your data as strings, colClasses=rep("character", 800), I'd guess you will see a very different timing. Even simple reshaping of your matrix, say make it (12500*80) rows by 10 columns, will considerably worsen it. Please let me know the results if you do anything of the above.

In my message to r-devel you may also find some timing that supports my estimates.

Thanks,

Vadim

**> As part of a continuing thread on the cost of loading large
**> amounts of data into R,
**>
**> "Vadim Ogranovich" <vograno@evafunds.com> wrote:
**> R's IO is indeed 20 - 50 times slower than that of
**> equivalent C code
**> no matter what you do, which has been a pain for some of us.
**>
**> I wondered to myself just how bad R is at reading, when it is
**> given a fair chance. So I performed an experiment.
**> My machine (according to "Workstation Info") is a SunBlade
**> 100 with 640MB of physical memory running SunOS 5.9 Generic,
**> according to fpversion this is an Ultra2e with the CPU clock
**> running at 500MHz and the main memory clock running at 84MHz
**> (wow, slow memory). R.version is platform sparc-sun-solaris2.9
**> arch sparc
**> os solaris2.9
**> system sparc, solaris2.9
**> status
**> major 1
**> minor 9.0
**> year 2004
**> month 04
**> day 12
**> language R
**> and althnough this is a 64-bit machine, it's a 32-bit
**> installation of R.
**>
**> The experiment was this:
**> (1) I wrote a C program that generated 12500 rows of 800 columns, the
**> numbers were integers 0..999,999,999 generated using drand48().
**> These numbers were written using printf(). It is possible to do
**> quite a bit better by avoiding printf(), but that would ruin the
**> spirit of the comparison, which is to see what can be done with
**> *straightforward* code using *existing* library functions.
**>
**> 21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds.
**>
**> The sizes were chosen to get 100MB; the actual size was
**> 12500 (lines) 10000000 (words) 100012500 (bytes)
**>
**> (2) I wrote a C program that read these numbers using
**> scanf("%d"); it
**> "knew" there were 800 numbers per row and 12500 numbers in all.
**> Again, it is possible to do better by avoiding scanf(), but the
**> point is to look at *straightforward* code.
**>
**> 18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds.
**>
**> (3) I started R, played around a bit doing other things, then
**> issued this
**> command:
**>
**> > system.time(xx <- read.table("/tmp/big.dat",
**> header=FALSE, quote="",
**> + row.names=NULL, colClasses=rep("numeric",800), nrows=12500,
**> + comment.char="")
**>
**> So how long _did_ it take to read 100MB on this machine?
**>
**> 71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds.
**>
**> The result: the R/C ratio was less than 4, whether you
**> measure cpu time or real time. It certainly wasn't anywhere
**> near 20-50 times slower.
**>
**> Of course, *binary* I/O in C *would* be quite a bit faster:
**> (1') generate same integers but write a row at a time using fwrite():
**> 5 seconds cpu, 25 seconds real; 40 MB.
**>
**> (2') read same integers a row at a time using fread()
**> 0.26 seconds cpu, 1 second real.
**>
**> This would appear to more than justify "20-50 times slower",
**> but reading binary data and reading data in a textual
**> representation are different things, "less than 4 times
**> slower" is the fairer measure. However, it does emphasise
**> the usefulness of problem-specific bulk reading techniques.
**>
**> I thought I'd give you another R measurement:
**> > system.time(xx <- read.table("/tmp/big.dat", header=FALSE))
**> But I got sick of waiting for it, and killed it after 843 cpu seconds,
**> 3075 real seconds. Without knowing how far it had got, one
**> can say no more than that this is at least 10 times slower
**> than the more informed call to read.table.
**>
**> What this tells me is that if you know something about the
**> data that you _could_ tell read.table about, you do yourself
**> no favour by keeping read.table in the dark. All those
**> options are there for a reason, and it *will* pay to use them.
**>
**> ______________________________________________
**> R-help@stat.math.ethz.ch mailing list
**> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide!
**> http://www.R-project.org/posting-guide.html
**>
