Re: [R] Re : Large database help

From: Rogerio Porto <>
Date: Thu 18 May 2006 - 09:17:29 EST

Thank you all for the discussion.

I'll try to summarize the suggestions and give some partial conclusions for sake of completeness of this thread.

First, I had read the I/O manual but had forgotten the function read.fwf as suggested by Roger Peng. I'm sorry. But, following manual orientation, this function is not recommended for large files and I need to discover how to read fixed-width-format files using scan function, since there isn't such an example in that manual neither in ?scan. At a glance, it seems the function read.fwf writes blank spaces among column pointers in order to read the file using a simple scan() function.

I've also read the I/O manual, mainly chapter 4 about using Relational Databases.
This suggestion was appointed by Uwe Ligges and Justin Bem who advocated the use of MySQL with RMySQL package. I'm still installing MySQL to try to convert my fixed-width-format file to that database but, from the I/O manual, it seems I can only calculate five descriptive statistics (aggregate functions). So I couldn't calculate medians or more advanced statistics like a cluster analysis.
This point was one from Robert Citek and thus, I'm not sure that working with MySQL will help to solve my problem. RMySQL has dbApply function that apply R functions to groups (chunks) of database rows.

There was a suggestion to subset the file, by Roger Peng. Almost all participants in this thread noted the need of lots of RAM to work with a few variables as suggested by Prof. Brian Ripley.

The future looks promising through a collection *big* of packages specially designed to handle big data files in almost any hardwarea and OS configuration although time-demanding in some cases. It seems the first one in this collection is the biglm package by Thomas Lumley cited by Greg Snow. The obvious drawback is that one hat to re-write every package that can't handle big data files or, al least, their most memory demanding operations. This last point could be implemented by an option like big.file=TRUE to be incorporated at some functions. This point of view is one of *scaling up* the methods.

Another promising way is to *scale down* the dataset. Statisticians are aware of these techniques from non-hierarquical cluster analysis and principal component analysis among others (mainly sampling). Engineers and signal processing people know them from data compression techniques. Computer scientists work with training sets and dataming wich use methods to scale down datasets. An example was given by Richard M. Heiberger who cites a paper from William DuMouchel et al. on Squashing Flat Files. Maybe could be some R functions specialized in these methods that, using DBMS, could retrieve significant data (records and variables) that could be handled by R.

That's all, for a while!

Rogerio. mailing list PLEASE do read the posting guide! Received on Thu May 18 09:21:00 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 18 May 2006 - 10:10:12 EST.

Mailing list information is available at Please read the posting guide before posting to the list.