Re: [R] Tools For Preparing Data For Analysis

From: Douglas Bates <>
Date: Fri, 08 Jun 2007 07:47:00 -0500

On 6/7/07, Robert Wilkins <> wrote:
> As noted on the R-project web site itself ( ->
> Manuals -> R Data Import/Export ), it can be cumbersome to prepare
> messy and dirty data for analysis with the R tool itself. I've also
> seen at least one S programming book (one of the yellow Springer ones)
> that says, more briefly, the same thing.
> The R Data Import/Export page recommends examples using SAS, Perl,
> Python, and Java. It takes a bit of courage to say that ( when you go
> to a corporate software web site, you'll never see a page saying "This
> is the type of problem that our product is not the best at, here's
> what we suggest instead" ). I'd like to provide a few more
> suggestions, especially for volunteers who are willing to evaluate new
> candidates.
> SAS is fine if you're not paying for the license out of your own
> pocket. But maybe one reason you're using R is you don't have
> thousands of spare dollars.
> Using Java for data cleaning is an exercise in sado-masochism, Java
> has a learning curve (almost) as difficult as C++.
> There are different types of data transformation, and for some data
> preparation problems an all-purpose programming language is a good
> choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
> excellent regular expression facilities.
> However, for some types of complex demanding data preparation
> problems, an all-purpose programming language is a poor choice. For
> example: cleaning up and preparing clinical lab data and adverse event
> data - you could do it in Perl, but it would take way, way too much
> time. A specialized programming language is needed. And since data
> transformation is quite different from data query, SQL is not the
> ideal solution either.
> There are only three statistical programming languages that are
> well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
> popular than S for data cleaning.
> If you're an R user with difficult data preparation problems, frankly
> you are out of luck, because the products I'm about to mention are
> new, unknown, and therefore regarded as immature. And while the
> founders of these products would be very happy if you kicked the
> tires, most people don't like to look at brand new products. Most
> innovators and inventers don't realize this, I've learned it the hard
> way.
> But if you are a volunteer who likes to help out by evaluating,
> comparing, and reporting upon new candidates, well you could certainly
> help out R users and the developers of the products by kicking the
> tires of these products. And there is a huge need for such volunteers.
> 1. DAP
> This is an open source implementation of SAS.
> The founder: Susan Bassein
> Find it at: (GNU GPL)
> 2. PSPP
> This is an open source implementation of SPSS.
> The relatively early version number might not give a good idea of how
> mature the
> data transformation features are, it reflects the fact that he has
> only started doing the statistical tests.
> The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept.
> Also at : (GNU GPL)
> 3. Vilno
> This uses a programming language similar to SPSS and SAS, but quite unlike S.
> Essentially, it's a substitute for the SAS datastep, and also
> transposes data and calculates averages and such. (No t-tests or
> regressions in this version). I created this, during the years
> 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
> my opinion. The tarball includes about 100 or so test cases used for
> debugging - for logical calculation errors, but not for extremely high
> volumes of data.
> The maintenance of Vilno has slowed down, because I am currently
> (desparately) looking for employment. But once I've found new
> employment and living quarters and settled in, I will continue to
> enhance Vilno in my spare time.
> The founder: that would be me, Robert Wilkins
> Find it at: ( GNU GPL )
> ( In particular, the tarball at
> , since I have yet to figure out how to use Subversion ).
> 4. Who knows?
> It was not easy to find out about the existence of DAP and PSPP. So
> who knows what else is out there. However, I think you'll find a lot
> more statistics software ( regression , etc ) out there, and not so
> much data transformation software. Not many people work on data
> preparation software. In fact, the category is so obscure that there
> isn't one agreed term: data cleaning , data munging , data crunching ,
> or just getting the data ready for analysis.

Thanks for bringing up this topic. I think there is definitely a place for such languages, which I would regard as data-filtering languages, but I also think that trying to reproduce the facilities in SAS or SPSS for data analysis is redundant.

Other responses in this thread have mentioned 'little language' filters like awk, which is fine for those who were raised in the Bell Labs tradition of programming ("why type three characters when two character names should suffice for anything one wants to do on a PDP-11") but the typical field scientist finds this a bit too terse to understand and would rather write a filter as a paragraph of code that they have a change of reading and understanding a week later.

Frank Harrell indicated that it is possible to do a lot of difficult data transformation within R itself if you try hard enough but that sometimes means working against the S language and its "whole object" view to accomplish what you want and it can require knowledge of subtle aspects of the S language.

General scripting languages like Perl, Python and Ruby can certainly be used for data filtering but that means learning the language and its idiosyncrasies, and those idiosyncrasies are often exactly the aspects that would be used to write a filter tersely. Readability suffers. ("Hell is reading someone else's Perl code - purgatory is reading your own Perl code.") The very generality of the languages means there is a lot to learn and understand before you can write something like a simple filter.

So I do agree that it would be useful to have a language like the SAS data step (but Open Source, of course) in which to write a data filter. I have one suggestion to make - use the R data frame structure in the form of a .rda file as the binary output format for a data table. That way the user can get the best of both worlds by using a language like Viino to manipulate and rearrange huge data files then switching to R for the graphics and data analysis. As a further enhancement one might provide the ability to take a .rda file that contains a single data frame and select columns or rows, including a random sample of the rows, as a filter.

Producing an R data frame may involve passing over the data twice, once to determine the size of the resulting structure and the second time to evaluate the data itself. This would have been a horrific penalty in the days that SAS and SPSS were developed but not now. mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Fri 08 Jun 2007 - 12:53:46 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 10 Jun 2007 - 11:31:49 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.