RE: [R] naive question

From: Richard A. O'Keefe <ok_at_cs.otago.ac.nz>
Date: Fri 02 Jul 2004 - 10:22:21 EST


As part of a continuing thread on the cost of loading large amounts of data into R,

"Vadim Ogranovich" <vograno@evafunds.com> wrote:

	R's IO is indeed 20 - 50 times slower than that of equivalent C code
	no matter what you do, which has been a pain for some of us.

I wondered to myself just how bad R is at reading, when it is given a fair chance. So I performed an experiment. My machine (according to "Workstation Info") is a SunBlade 100 with 640MB of physical memory running SunOS 5.9 Generic, according to fpversion this is an Ultra2e with the CPU clock running at 500MHz and the main memory clock running at 84MHz (wow, slow memory). R.version is platform sparc-sun-solaris2.9

arch     sparc               
os       solaris2.9          
system   sparc, solaris2.9   
status                       
major    1                   
minor    9.0                 
year     2004                
month    04                  
day      12                  
language R                   

and althnough this is a 64-bit machine, it's a 32-bit installation of R.

The experiment was this:
(1) I wrote a C program that generated 12500 rows of 800 columns, the

    numbers were integers 0..999,999,999 generated using drand48().     These numbers were written using printf(). It is possible to do     quite a bit better by avoiding printf(), but that would ruin the     spirit of the comparison, which is to see what can be done with     *straightforward* code using *existing* library functions.

    21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds.

    The sizes were chosen to get 100MB; the actual size was     12500 (lines) 10000000 (words) 100012500 (bytes)

(2) I wrote a C program that read these numbers using scanf("%d"); it

    "knew" there were 800 numbers per row and 12500 numbers in all.     Again, it is possible to do better by avoiding scanf(), but the     point is to look at *straightforward* code.

    18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds.

(3) I started R, played around a bit doing other things, then issued this

    command:

    > system.time(xx <- read.table("/tmp/big.dat", header=FALSE, quote="",     + row.names=NULL, colClasses=rep("numeric",800), nrows=12500,     + comment.char="")

    So how long _did_ it take to read 100MB on this machine?

    71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds.

The result: the R/C ratio was less than 4, whether you measure cpu time or real time. It certainly wasn't anywhere near 20-50 times slower.

Of course, *binary* I/O in C *would* be quite a bit faster:
(1') generate same integers but write a row at a time using fwrite():

     5 seconds cpu, 25 seconds real; 40 MB.

(2') read same integers a row at a time using fread()

     0.26 seconds cpu, 1 second real.

This would appear to more than justify "20-50 times slower", but reading binary data and reading data in a textual representation are different things, "less than 4 times slower" is the fairer measure. However, it does emphasise the usefulness of problem-specific bulk reading techniques.

I thought I'd give you another R measurement: > system.time(xx <- read.table("/tmp/big.dat", header=FALSE)) But I got sick of waiting for it, and killed it after 843 cpu seconds, 3075 real seconds. Without knowing how far it had got, one can say no more than that this is at least 10 times slower than the more informed call to read.table.

What this tells me is that if you know something about the data that you _could_ tell read.table about, you do yourself no favour by keeping read.table in the dark. All those options are there for a reason, and it *will* pay to use them.



R-help@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Jul 02 10:26:31 2004

This archive was generated by hypermail 2.1.8 : Fri 18 Mar 2005 - 09:15:43 EST