Re: [R] Processing large datasets/ non answer but Q on writing data frame derivative.

From: Mike Marchywka <marchywka_at_hotmail.com>
Date: Wed, 25 May 2011 10:55:08 -0400



> Date: Wed, 25 May 2011 09:49:00 -0400
> From: roman_at_bestroman.com
> To: biomathjdaily_at_gmail.com
> CC: r-help_at_r-project.org
> Subject: Re: [R] Processing large datasets
>
> Thanks Jonathan.
>
> I'm already using RMySQL to load data for couple of days.
> I wanted to know what are the relevant R capabilities if I want to process much bigger tables.
>
> R always reads the whole set into memory and this might be a limitation in case of big tables, correct?

ok, now I ask, perhaps for my first R effort I will try to find source code for data frame and make a paging or streaming derivative. That is, at least for fixed size things, it can supply things like number of total rows but has facilities for paging in and out of memory. Presumably all users of data frame have to work through a limited interface which I guess could be expanded with various hints on " prefetch this" for example. I haven't looked at this idea in a while but the issue keeps coming up, dev list maybe?

Anyway, for your immediate issues with a few statistics you could probably write a simple c++ program that ultimately becomes part of an R package. It is a good idea to see what is available but these questions come up here a lot and the normal suggestion is "DB" which is exactly the opposite of what you want if you have predictable access patterns ( although even here prefetch could probably be implemented).

> Doesn't it use temporary files or something similar to deal such amount of data?
>
> As an example I know that SAS handles sas7bdat files up to 1TB on a box with 76GB memory, without noticeable issues.
>
> --Roman
>
> ----- Original Message -----
>
> > In cases where I have to parse through large datasets that will not
> > fit into R's memory, I will grab relevant data using SQL and then
> > analyze said data using R. There are several packages designed to do
> > this, like [1] and [2] below, that allow you to query a database
> > using
> > SQL and end up with that data in an R data.frame.
>
> > [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
> > [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html
>
> > On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
> > wrote:
> > > Hi R list,
> > >
> > > I'm new to R software, so I'd like to ask about it is capabilities.
> > > What I'm looking to do is to run some statistical tests on quite
> > > big
> > > tables which are aggregated quotes from a market feed.
> > >
> > > This is a typical set of data.
> > > Each day contains millions of records (up to 10 non filtered).
> > >
> > > 2011-05-24 750 Bid DELL 14130770 400
> > > 15.4800 BATS 35482391 Y 1 1 0 0
> > > 2011-05-24 904 Bid DELL 14130772 300
> > > 15.4800 BATS 35482391 Y 1 0 0 0
> > > 2011-05-24 904 Bid DELL 14130773 135
> > > 15.4800 BATS 35482391 Y 1 0 0 0
> > >
> > > I'll need to filter it out first based on some criteria.
> > > Since I keep it mysql database, it can be done through by query.
> > > Not
> > > super efficient, checked it already.
> > >
> > > Then I need to aggregate dataset into different time frames (time
> > > is
> > > represented in ms from midnight, like 35482391).
> > > Again, can be done through a databases query, not sure what gonna
> > > be faster.
> > > Aggregated tables going to be much smaller, like thousands rows per
> > > observation day.
> > >
> > > Then calculate basic statistic: mean, standard deviation, sums etc.
> > > After stats are calculated, I need to perform some statistical
> > > hypothesis tests.
> > >
> > > So, my question is: what tool faster for data aggregation and
> > > filtration
> > > on big datasets: mysql or R?
> > >
> > > Thanks,
> > > --Roman N.
> > >
> > > [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help_at_r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
>
> > --
> > ===============================================
> > Jon Daily
> > Technician
> > ===============================================
> > #!/usr/bin/env outside
> > # It's great, trust me.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
                                               



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 25 May 2011 - 14:56:44 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 25 May 2011 - 15:20:10 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive