Re: [R] Tools For Preparing Data For Analysis

From: Stephen Tucker <>
Date: Sun, 10 Jun 2007 12:27:50 -0700 (PDT)

Since R is supposed to be a complete programming language, I wonder why these tools couldn't be implemented in R (unless speed is the issue). Of course, it's a naive desire to have a single language that does everything, but it seems that R currently has most of the functions necessary to do the type of data cleaning described.

For instance, Gabor and Peter showed some snippets of ways to do this elegantly; my [physical science] data is often not as horrendously structured so usually I can get away with a program containing this type of code

txtin <- scan(filename,what="",sep="\n") filteredList <- lapply(strsplit(txtin,delimiter),FUN=filterfunction)

   # fiteringfunction() returns selected (and possibly transformed
   # elements if present and NULL otherwise
   # may include calls to grep(), regexpr(), gsub(), substring(),...
   # nchar(), sscanf(), type.convert(), paste(), etc.
mydataframe <-,filteredList)

   # then match(), subset(), aggregate(), etc.

In the case that the file is large, I open a file connection and scan a single line + apply filterfunction() successively in a FOR-LOOP instead of using lapply(). Of course, the devil is in the details of the filtering function, but I believe most of the required text processing facilities are already provided by R.

I often have tasks that involve a combination of shell-scripting and text processing to construct the data frame for analysis; I started out using Python+NumPy to do the front-end work but have been using R progressively more (frankly, all of it) to take over that portion since I generally prefer the data structures and methods in R.

> Douglas Bates wrote:
> > Frank Harrell indicated that it is possible to do a lot of difficult
> > data transformation within R itself if you try hard enough but that
> > sometimes means working against the S language and its "whole object"
> > view to accomplish what you want and it can require knowledge of
> > subtle aspects of the S language.
> >
> Actually, I think Frank's point was subtly different: It is *because* of
> the differences in view that it sometimes seems difficult to find the
> way to do something in R that is apparently straightforward in SAS.
> I.e. the solutions exist and are often elegant, but may require some
> lateral thinking.
> Case in point: Finding the first or the last observation for each
> subject when there are multiple records for each subject. The SAS way
> would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that
> you can compare the subject ID with the one from the previous record,
> working with data that are sorted appropriately.

> You can do the same thing in R with a for loop, but there are better
> ways e.g.
> subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or
> maybe
>"rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or

> something involving aggregate(). (The latter approaches generalize
> better to other within-subject functionals like cumulative doses, etc.).
> The hardest cases that I know of are the ones where you need to turn one
> record into many, such as occurs in survival analysis with
> time-dependent, piecewise constant covariates. This may require
> "transposing the problem", i.e. for each interval you find out which
> subjects contribute and with what, whereas the SAS way would be a
> within-subject loop over intervals containing an OUTPUT statement.

> Also, there are some really weird data formats, where e.g. the input
> format is different in different records. Back in the 80's where
> punched-card input was still common, it was quite popular to have one
> card with background information on a patient plus several cards
> detailing visits, and you'd get a stack of cards containing both kinds.
> In R you would most likely split on the card type using grep() and then
> read the two kinds separately and merge() them later.

> ______________________________________________
> mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.

Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center. mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Sun 10 Jun 2007 - 19:44:25 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 10 Jun 2007 - 22:31:48 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.