Re: [R] Processing large datasets

From: Steve Lianoglou <>
Date: Wed, 25 May 2011 10:00:31 -0400


On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko <> wrote:
> Hi R list,
> I'm new to R software, so I'd like to ask about it is capabilities.
> What I'm looking to do is to run some statistical tests on quite big
> tables which are aggregated quotes from a market feed.
> This is a typical set of data.
> Each day contains millions of records (up to 10 non filtered).
> 2011-05-24      750     Bid     DELL    14130770        400
> 15.4800         BATS    35482391        Y       1       1       0       0
> 2011-05-24      904     Bid     DELL    14130772        300
> 15.4800         BATS    35482391        Y       1       0       0       0
> 2011-05-24      904     Bid     DELL    14130773        135
> 15.4800         BATS    35482391        Y       1       0       0       0
> I'll need to filter it out first based on some criteria.
> Since I keep it mysql database, it can be done through by query. Not
> super efficient, checked it already.
> Then I need to aggregate dataset into different time frames (time is
> represented in ms from midnight, like 35482391).
> Again, can be done through a databases query, not sure what gonna be faster.
> Aggregated tables going to be much smaller, like thousands rows per
> observation day.
> Then calculate basic statistic: mean, standard deviation, sums etc.
> After stats are calculated, I need to perform some statistical
> hypothesis tests.
> So, my question is: what tool faster for data aggregation and filtration
> on big datasets: mysql or R?

Why not try a few experiments and see for yourself -- I guess the answer will depend on what exactly you are doing.

If your datasets are *really* huge, check out some packages listed under the "Large memory and out-of-memory data" section of the "HighPerformanceComputing" task view at CRAN:

Also, if you find yourself needing to do lots of "grouping/summarizing" type of calculations over large data frame-like objects, you might want to check out the data.table package:

Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info:

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Wed 25 May 2011 - 14:03:31 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 25 May 2011 - 14:40:10 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive