[R] Tools For Preparing Data For Analysis

From: Robert Wilkins <irishhacker_at_gmail.com>
Date: Thu, 07 Jun 2007 19:01:41 -0500

As noted on the R-project web site itself ( www.r-project.org -> Manuals -> R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying "This is the type of problem that our product is not the best at, here's what we suggest instead" ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates.

SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars.
Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++.

There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities.

However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either.

There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning.

If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way.

But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers.

  1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL)
  2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL)
  3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently
    (desparately) looking for employment. But once I've found new
    employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL )
    ( In particular, the tarball at code.google.com/p/vilno/downloads/list
    , since I have yet to figure out how to use Subversion ).
  4. Who knows? It was not easy to find out about the existence of DAP and PSPP. So who knows what else is out there. However, I think you'll find a lot more statistics software ( regression , etc ) out there, and not so much data transformation software. Not many people work on data preparation software. In fact, the category is so obscure that there isn't one agreed term: data cleaning , data munging , data crunching , or just getting the data ready for analysis.

R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 08 Jun 2007 - 00:10:01 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 08 Jun 2007 - 13:31:47 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.