From: Marc Schwartz <marc_schwartz_at_me.com>

Date: Wed, 25 May 2011 09:46:57 -0500

>>> Hi R list,

*>>>
*

*>>> I'm new to R software, so I'd like to ask about it is capabilities.
*

*>>> What I'm looking to do is to run some statistical tests on quite
*

*>>> big
*

*>>> tables which are aggregated quotes from a market feed.
*

*>>>
*

*>>> This is a typical set of data.
*

*>>> Each day contains millions of records (up to 10 non filtered).
*

*>>>
*

*>>> 2011-05-24 750 Bid DELL 14130770 400
*

*>>> 15.4800 BATS 35482391 Y 1 1 0 0
*

*>>> 2011-05-24 904 Bid DELL 14130772 300
*

*>>> 15.4800 BATS 35482391 Y 1 0 0 0
*

*>>> 2011-05-24 904 Bid DELL 14130773 135
*

*>>> 15.4800 BATS 35482391 Y 1 0 0 0
*

*>>>
*

*>>> I'll need to filter it out first based on some criteria.
*

*>>> Since I keep it mysql database, it can be done through by query.
*

*>>> Not
*

*>>> super efficient, checked it already.
*

*>>>
*

*>>> Then I need to aggregate dataset into different time frames (time
*

*>>> is
*

*>>> represented in ms from midnight, like 35482391).
*

*>>> Again, can be done through a databases query, not sure what gonna
*

*>>> be faster.
*

*>>> Aggregated tables going to be much smaller, like thousands rows per
*

*>>> observation day.
*

*>>>
*

*>>> Then calculate basic statistic: mean, standard deviation, sums etc.
*

*>>> After stats are calculated, I need to perform some statistical
*

*>>> hypothesis tests.
*

*>>>
*

*>>> So, my question is: what tool faster for data aggregation and
*

*>>> filtration
*

*>>> on big datasets: mysql or R?
*

*>>>
*

*>>> Thanks,
*

*>>> --Roman N.
*

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 25 May 2011 - 14:51:50 GMT

Date: Wed, 25 May 2011 09:46:57 -0500

Take a look at the High-Performance and Parallel Computing with R CRAN Task View:

http://cran.us.r-project.org/web/views/HighPerformanceComputing.html

specifically at the section labeled "Large memory and out-of-memory data".

There are some specific R features that have been implemented in a fashion to enable out of memory operations, but not all.

I believe that Revolution's commercial version of R, has developed 'big data' functionality, but would defer to them for additional details.

You can of course use a 64 bit version of R on a 64 bit OS to increase accessible RAM, however, there will still be object size limitations predicated upon the fact that R uses 32 bit signed integers for indexing into objects. See ?"Memory-limits" for more information.

**HTH,
**
Marc Schwartz

On May 25, 2011, at 8:49 AM, Roman Naumenko wrote:

> Thanks Jonathan. > > I'm already using RMySQL to load data for couple of days. > I wanted to know what are the relevant R capabilities if I want to process much bigger tables. > > R always reads the whole set into memory and this might be a limitation in case of big tables, correct? > Doesn't it use temporary files or something similar to deal such amount of data? > > As an example I know that SAS handles sas7bdat files up to 1TB on a box with 76GB memory, without noticeable issues. > > --Roman > > ----- Original Message ----- > >> In cases where I have to parse through large datasets that will not >> fit into R's memory, I will grab relevant data using SQL and then >> analyze said data using R. There are several packages designed to do >> this, like [1] and [2] below, that allow you to query a database >> using >> SQL and end up with that data in an R data.frame. > >> [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html >> [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html > >> On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko >> <roman_at_bestroman.com> wrote:

>>> Hi R list,

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 25 May 2011 - 14:51:50 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Wed 25 May 2011 - 15:00:10 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*