Re: [R] Tools For Preparing Data For Analysis

From: Ted Harding <>
Date: Fri, 08 Jun 2007 10:43:14 +0100 (BST)

On 08-Jun-07 08:27:21, Christophe Pallier wrote:
> Hi,
> Can you provide examples of data formats that are problematic
> to read and clean with R ?
> The only problematic cases I have encountered were cases with
> multiline and/or varying length records (optional information).
> Then, it is sometimes a good idea to preprocess the data to
> present in a tabular format (one record per line).
> For this purpose, I use awk (e.g.
> which is very adept at processing ascii data files (awk is
> much simpler to learn than perl, spss, sas, ...).

I want to join in with an enthusiastic "Me too!!". For anything which has to do with basic checking for the kind of messes that people can get data into when they "put it on the computer", I think awk is ideal. It is very flexible (far more so than many, even long-time, awk users suspect), very transparent in its programming language (as opposed to say perl), fast, and with light impact on system resources (rare delight in these days, when upgrading your software may require upgrading your hardware).

Although it may seem on the surface that awk is "two-dimensional" in its view of data (line by line, and per field in a line), it has some flexible internal data structures and recursive function capability, which allows a lot more to be done with the data that have been read in.

For example, I've used awk to trace ancestry through a genealogy, given a data file where each line includes the identifier of an individual and the identifiers of its male and female parents (where known). And that was for pedigree dogs, where what happens in real life makes Oedipus look trivial.

> I have never encountered a data file in ascii format that I
> could not reformat with Awk. With binary formats, it is
> another story...

But then it is a good idea to process the binary file using an instance of the creating software, to produce a ASCII file (say in CSV format).

> But, again, this is my limited experience; I would like to
> know if there are situations where using SAS/SPSS is really
> a better approach.

The main thing often useful for data cleaning that awk does not have is any associated graphics. It is -- by design -- a line-by-line text-file processor. While, for instance, you could use awk to accumulate numerical histogram counts, you would have to use something else to display the histogram. And for scatter-plots there's probably not much point in bringing awk into the picture at all (unless a preliminary filtration of mess is needed anyway).

That being said, though, there can still be a use to extract data fields from a file for submission to other software.

Another kind of area where awk would not have much to offer is where, as a part of your preliminary data inspection, you want to inspect the results of some standard statistical analyses.

As a final comment, utilities like awk can be used far more fruitfully on operating systems (the unixoid family) which incorporate at ground level the infrastructure for "plumbing" together streams of data output from different programs.


E-Mail: (Ted Harding) <> Fax-to-email: +44 (0)870 094 0861
Date: 08-Jun-07                                       Time: 10:43:05
------------------------------ XFMail ------------------------------

______________________________________________ mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Fri 08 Jun 2007 - 09:55:03 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 08 Jun 2007 - 11:31:31 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.