Re: [R] Tools For Preparing Data For Analysis

From: Peter Dalgaard <>
Date: Sun, 10 Jun 2007 12:25:35 +0200

Douglas Bates wrote:
> Frank Harrell indicated that it is possible to do a lot of difficult
> data transformation within R itself if you try hard enough but that
> sometimes means working against the S language and its "whole object"
> view to accomplish what you want and it can require knowledge of
> subtle aspects of the S language.
Actually, I think Frank's point was subtly different: It is *because* of the differences in view that it sometimes seems difficult to find the way to do something in R that is apparently straightforward in SAS. I.e. the solutions exist and are often elegant, but may require some lateral thinking.

Case in point: Finding the first or the last observation for each subject when there are multiple records for each subject. The SAS way would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that you can compare the subject ID with the one from the previous record, working with data that are sorted appropriately.

You can do the same thing in R with a for loop, but there are better ways e.g.
subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or maybe"rbind",lapply(split(df,df$ID), head, 1)), resp. tail. Or something involving aggregate(). (The latter approaches generalize better to other within-subject functionals like cumulative doses, etc.).

The hardest cases that I know of are the ones where you need to turn one record into many, such as occurs in survival analysis with time-dependent, piecewise constant covariates. This may require "transposing the problem", i.e. for each interval you find out which subjects contribute and with what, whereas the SAS way would be a within-subject loop over intervals containing an OUTPUT statement.

Also, there are some really weird data formats, where e.g. the input format is different in different records. Back in the 80's where punched-card input was still common, it was quite popular to have one card with background information on a patient plus several cards detailing visits, and you'd get a stack of cards containing both kinds. In R you would most likely split on the card type using grep() and then read the two kinds separately and merge() them later. mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Sun 10 Jun 2007 - 10:33:21 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 15 Jun 2007 - 01:34:01 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.