Re: [R] Processing large datasets

From: Marc Schwartz <>
Date: Wed, 25 May 2011 09:46:57 -0500

Take a look at the High-Performance and Parallel Computing with R CRAN Task View:

specifically at the section labeled "Large memory and out-of-memory data".

There are some specific R features that have been implemented in a fashion to enable out of memory operations, but not all.

I believe that Revolution's commercial version of R, has developed 'big data' functionality, but would defer to them for additional details.

You can of course use a 64 bit version of R on a 64 bit OS to increase accessible RAM, however, there will still be object size limitations predicated upon the fact that R uses 32 bit signed integers for indexing into objects. See ?"Memory-limits" for more information.

HTH, Marc Schwartz

On May 25, 2011, at 8:49 AM, Roman Naumenko wrote:

> Thanks Jonathan. 
> I'm already using RMySQL to load data for couple of days. 
> I wanted to know what are the relevant R capabilities if I want to process much bigger tables. 
> R always reads the whole set into memory and this might be a limitation in case of big tables, correct? 
> Doesn't it use temporary files or something similar to deal such amount of data? 
> As an example I know that SAS handles sas7bdat files up to 1TB on a box with 76GB memory, without noticeable issues. 
> --Roman 
> ----- Original Message -----
>> In cases where I have to parse through large datasets that will not
>> fit into R's memory, I will grab relevant data using SQL and then
>> analyze said data using R. There are several packages designed to do
>> this, like [1] and [2] below, that allow you to query a database
>> using
>> SQL and end up with that data in an R data.frame.
>> [1]
>> [2]
>> On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
>> <> wrote:

>>> Hi R list,
>>> I'm new to R software, so I'd like to ask about it is capabilities.
>>> What I'm looking to do is to run some statistical tests on quite
>>> big
>>> tables which are aggregated quotes from a market feed.
>>> This is a typical set of data.
>>> Each day contains millions of records (up to 10 non filtered).
>>> 2011-05-24 750 Bid DELL 14130770 400
>>> 15.4800 BATS 35482391 Y 1 1 0 0
>>> 2011-05-24 904 Bid DELL 14130772 300
>>> 15.4800 BATS 35482391 Y 1 0 0 0
>>> 2011-05-24 904 Bid DELL 14130773 135
>>> 15.4800 BATS 35482391 Y 1 0 0 0
>>> I'll need to filter it out first based on some criteria.
>>> Since I keep it mysql database, it can be done through by query.
>>> Not
>>> super efficient, checked it already.
>>> Then I need to aggregate dataset into different time frames (time
>>> is
>>> represented in ms from midnight, like 35482391).
>>> Again, can be done through a databases query, not sure what gonna
>>> be faster.
>>> Aggregated tables going to be much smaller, like thousands rows per
>>> observation day.
>>> Then calculate basic statistic: mean, standard deviation, sums etc.
>>> After stats are calculated, I need to perform some statistical
>>> hypothesis tests.
>>> So, my question is: what tool faster for data aggregation and
>>> filtration
>>> on big datasets: mysql or R?
>>> Thanks,
>>> --Roman N. mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Wed 25 May 2011 - 14:51:50 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 25 May 2011 - 15:00:10 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive