Re: [R] FW: Large datasets in R

From: François Pinard <pinard_at_iro.umontreal.ca>
Date: Wed 19 Jul 2006 - 08:56:26 EST

[Thomas Lumley]

>People have used R in this way, storing data in a database and reading it
>as required. There are also some efforts to provide facilities to support
>this sort of programming (such as the current project funded by Google
>Summer of Code: http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html).

Interesting project indeed! However, if R requires uses more swapping because arrays do not all fit in physical memory, crudely replacing swapping with database accesses is not necessarily going to buy a drastic speed improvement: the paging gets done in user space instead of being done in the kernel.

Long ago, while working on CDC mainframes, astonishing at the time but tiny by nowadays standards, there was a program able to invert or do simplexes on very big matrices. I do not remember the name of the program, and never studied it but superficially (I was in computer support for researchers, but not a researcher myself). The program was documented as being extremely careful at organising accesses to rows and columns (or parts thereof) in such a way that real memory was best used. In other words, at the core of this program was a paging system very specialised and cooperative with the problems meant to be solved.

However, the source of this program was just plain huge (let's say from memory, about three or four times the size of the optimizing FORTRAN compiler, which I already knew better as an impressive algorithmic undertaking). So, good or wrong, the prejudice stuck solidly in me at the time, if nothing else, that handling big arrays the right way, speed-wise, ought to be very difficult.

>One reason there isn't more of this is that relying on Moore's Law has
>worked very well over the years.

On the other hand, the computational needs for scientific problems grow fairly quickly to the size of our ability to solve them. Let me take weather forecasting for example. 3-D geographical grids are never fine enough for the resolution meteorologists would like to get, and the time required for each prediction step grows very rapidly, to increase precision by not so much. By merely tuning a few parameters, these people may easily pump nearly all the available cycles out the supercomputers given to them, and they do so without hesitation. Moore's Law will never succeed at calming their starving hunger! :-).

-- 
François Pinard   http://pinard.progiciels-bpi.ca

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Wed Jul 19 09:00:46 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 19 Jul 2006 - 10:18:42 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.