[R] Processing large datasets

From: Roman Naumenko <roman_at_bestroman.com>
Date: Wed, 25 May 2011 00:29:03 -0400


Hi R list,

I'm new to R software, so I'd like to ask about it is capabilities. What I'm looking to do is to run some statistical tests on quite big tables which are aggregated quotes from a market feed.

This is a typical set of data.
Each day contains millions of records (up to 10 non filtered).

2011-05-24      750     Bid     DELL    14130770        400     
15.4800         BATS    35482391        Y       1       1       0       0
2011-05-24      904     Bid     DELL    14130772        300     
15.4800         BATS    35482391        Y       1       0       0       0
2011-05-24      904     Bid     DELL    14130773        135     
15.4800         BATS    35482391        Y       1       0       0       0

I'll need to filter it out first based on some criteria. Since I keep it mysql database, it can be done through by query. Not super efficient, checked it already.

Then I need to aggregate dataset into different time frames (time is represented in ms from midnight, like 35482391). Again, can be done through a databases query, not sure what gonna be faster. Aggregated tables going to be much smaller, like thousands rows per observation day.

Then calculate basic statistic: mean, standard deviation, sums etc. After stats are calculated, I need to perform some statistical hypothesis tests.

So, my question is: what tool faster for data aggregation and filtration on big datasets: mysql or R?

Thanks,
--Roman N.

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 25 May 2011 - 06:57:58 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 25 May 2011 - 14:20:09 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive