Re: [R] Can R replicate this data manipulation in SAS?

From: Gabor Grothendieck <ggrothendieck_at_gmail.com>
Date: Fri, 22 Apr 2011 11:27:11 -0400

Since this involves time series within each id group I thought it would be interesting to see if this could be formulated using zoo series. The approach is to read it in, and convert it to a long form by just stacking the start and stop times in a data frame and converting that to zoo using the concatenation of date and id as the time index. It makes use of the fact that

- zoo can use non-standard "times" (here the concatenation of date and id)
- read.zoo can split and aggregate and
- the resulting object can be directly cumsum'd using cumsum.zoo .

Finally we convert it back to the required form.

This was mostly done by looking at the output rather than trying to follow the SAS code so some differences are possible. It does seem to give the same output in this case.

Yet another approach might be to use model.matrix to create the NRTI/NNRTI/PI columns:

        model.matrix(~ drug - 1, long2) * long2$change but below we stick with zoo.

Some of the code here could be transplanted into Ista's solution (which uses the reshape and plyr packages) to get a combination approach.

Lines <- "1004 NRTI 07/24/95 01/05/99

1004 NRTI  11/20/95 12/10/95
1004 NRTI  01/10/96 01/05/99
1004 PI    05/09/96 11/16/97
1004 NRTI  06/01/96 02/01/97
1004 NRTI  07/01/96 03/01/97
9999 PI    01/02/03 .
9999 NNRTI 04/05/06 07/08/09"

library(zoo)

# need na.fill from development version of zoo
source("
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/zoo/R/na.fill.R?revision=831&root=zoo")

# read in data and reshape to long form

DF <- read.table(textConnection(Lines), as.is = TRUE, na.strings = ".",

        col.names = c("id", "drug", "date", "date"), check.names = FALSE) long <- rbind(cbind(DF[-4], change = 1), cbind(DF[-3], change = -1))

# convert to zoo. date.id, the concatenation of date & id, becomes index.
# Then take cumulative sums of columns and append regimen.

long2 <- with(long,

        data.frame(date.id = paste(as.Date(date, "%m/%d/%y"), id), drug, change)) z <- read.zoo(long2, index = 1, split = "drug", FUN = identity, aggregate = sum)

z <- cumsum(na.fill(z, fill = 0))
z$regimen <- ave(1:nrow(z), sub(".* ", "", time(z)), FUN = seq_along)

# convert z to data frame for final processing

id <- sub(".* ", "", time(z))
start_date = as.Date(sub(" .*", "", time(z))) d2 <- data.frame(id, start_date, coredata(z))

# from each group of rows in same id form the desired columns.
# f processes one such group.

f <- function(x) with(x, data.frame(

	id = head(id, -1),
	start_date = head(start_date, -1),
	stop_date = start_date[-1],
	head(cbind(regimen, NRTI, NNRTI, PI, all = NRTI + NNRTI + PI,
		  HAART = as.numeric((NRTI >= 3 & NNRTI==0 & PI==0) |
			(NRTI >= 2 & (NNRTI >= 1 | PI >= 1)) |
            (NRTI == 1 & NNRTI >= 1 & PI >= 1))), -1)))
result <- do.call(rbind, by(d2, id, f))
rownames(result) <- NULL

result

The end result is this:

> result

     id start_date  stop_date regimen NRTI NNRTI PI all HAART
1  1004 1995-07-24 1995-11-20       1    1     0  0   1     0
2  1004 1995-11-20 1995-12-10       2    2     0  0   2     0
3  1004 1995-12-10 1996-01-10       3    1     0  0   1     0
4  1004 1996-01-10 1996-05-09       4    2     0  0   2     0
5  1004 1996-05-09 1996-06-01       5    2     0  1   3     1
6  1004 1996-06-01 1996-07-01       6    3     0  1   4     1
7  1004 1996-07-01 1997-02-01       7    4     0  1   5     1
8  1004 1997-02-01 1997-03-01       8    3     0  1   4     1
9  1004 1997-03-01 1997-11-16       9    2     0  1   3     1
10 1004 1997-11-16 1999-01-05      10    2     0  0   2     0
11 9999 2003-01-02 2006-04-05       1    0     0  1   1     0
12 9999 2006-04-05 2009-07-08       2    0     1  1   2     0
13 9999 2009-07-08       <NA>       3    0     0  1   1     0




-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Fri 22 Apr 2011 - 15:32:50 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 22 Apr 2011 - 17:10:32 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive