Re: [R] Tools For Preparing Data For Analysis

From: Robert Wilkins <>
Date: Thu, 14 Jun 2007 19:35:18 -0500

[ Arrggh, not reply , but reply to all , cross my fingers again , sorry Peter! ]


I don't think you need a retain statement.

if first.patientID ;
if last.patientID ;

ought to do it.

It's actually better than the Vilno version, I must admit, a bit more concise:

if ( not firstrow(patientID) ) deleterow ;

Ah well.

For the folks asking for location of software ( I know posted it, but it didn't connect to the thread, and you get a huge number of posts each day , sorry):

Vilno , find at

DAP & PSPP, find at

Awk, find at lots of places,

Anything else? DAP & PSPP are hard to find, I'm sure there's more out there! What about MDX? Nahh, not really the right problem domain. Nobody uses MDX for this stuff.

If my examples , using clinical trial data are boring and hard to understand for those who asked for examples ( and presumably don't work in clinical trials) , let me know. Some of these other examples I'm reading about are quite interesting. It doesn't help that clinical trial databases cannot be public. Making a fake database would take a lot of time. The irony is , even with my deep understanding of data preparation in clinical trials,
the pharmas still don't want to give me a job ( because I was gone for many years).

Let's see if this post works : thanks to the folks who gave me advice on how to properly respond to a post within a thread . ( Although the thread in my gmail account is only a subset of the posts visible in the archives ). Crossing my fingers ....

On 6/10/07, Peter Dalgaard <> wrote:
> Douglas Bates wrote:
> > Frank Harrell indicated that it is possible to do a lot of difficult
> > data transformation within R itself if you try hard enough but that
> > sometimes means working against the S language and its "whole object"
> > view to accomplish what you want and it can require knowledge of
> > subtle aspects of the S language.
> >
> Actually, I think Frank's point was subtly different: It is *because* of
> the differences in view that it sometimes seems difficult to find the
> way to do something in R that is apparently straightforward in SAS.
> I.e. the solutions exist and are often elegant, but may require some
> lateral thinking.


> Case in point: Finding the first or the last observation for each
> subject when there are multiple records for each subject. The SAS way
> would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that
> you can compare the subject ID with the one from the previous record,
> working with data that are sorted appropriately.

> You can do the same thing in R with a for loop, but there are better
> ways e.g.
> subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or
> maybe
>"rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or
> something involving aggregate(). (The latter approaches generalize
> better to other within-subject functionals like cumulative doses, etc.).

> The hardest cases that I know of are the ones where you need to turn one
> record into many, such as occurs in survival analysis with
> time-dependent, piecewise constant covariates. This may require
> "transposing the problem", i.e. for each interval you find out which
> subjects contribute and with what, whereas the SAS way would be a
> within-subject loop over intervals containing an OUTPUT statement.

> Also, there are some really weird data formats, where e.g. the input
> format is different in different records. Back in the 80's where
> punched-card input was still common, it was quite popular to have one
> card with background information on a patient plus several cards
> detailing visits, and you'd get a stack of cards containing both kinds.
> In R you would most likely split on the card type using grep() and then
> read the two kinds separately and merge() them later.
> mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Fri 15 Jun 2007 - 00:44:36 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 15 Jun 2007 - 02:37:36 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.