Re: [R] FW: Large datasets in R

From: Thomas Lumley <tlumley_at_u.washington.edu>
Date: Wed 19 Jul 2006 - 01:34:02 EST

On Tue, 18 Jul 2006, Ritwik Sinha wrote:

> Hi,
>
> I have a related question. How differently do other statistical
> softwares handle large data?
>
> The original post claims that 350 MB is fine on Stata. Some one
> suggested S-Plus. I have heard people say that SAS can handle large
> data sets. Why can others do it and R seem to have a problem? Don't
> these softwares load the data onto RAM.
>

Stata does load the data into RAM and does have limits for the same reason that R does. However, Stata has a less flexible representation of its data (basically one rectangular dataset) and so it can handle somewhat larger data sets for any given memory size. For example, even with 512Gb of memory a 350Mb data set might be usable in Stata and with 1Gb it would certainly be. Stata is also faster for a given memory load, apparently because of its simpler language design [some evidence for this is that the recent language additions to support flexible graphics run rather more slowly than eg lattice in R].

The other approach is to write the estimation routines so that only part of the data need be in memory at a given time. *Some* procedures in SAS and SPSS work this way, and this is the idea of the S-PLUS 7.0 system for handling large data sets. This approach requires the programmer to handle the reading of sections of code from memory, something that can only be automated to a limited extent.

People have used R in this way, storing data in a database and reading it as required. There are also some efforts to provide facilities to support this sort of programming (such as the current project funded by Google Summer of Code: http://tolstoy.newcastle.edu.au/R/devel/06/05/5525.html). One reason there isn't more of this is that relying on Moore's Law has worked very well over the years.

          -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley@u.washington.edu	University of Washington, Seattle

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed Jul 19 01:39:32 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 19 Jul 2006 - 10:18:34 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.