From: Chris Wallace <c.wallace_at_qmul.ac.uk>

Date: Tue 02 Aug 2005 - 00:24:27 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Aug 02 00:41:03 2005

Date: Tue 02 Aug 2005 - 00:24:27 EST

I am struggling with migrating some stata code to R. I have a data
frame containing, sometimes, repeat observations (rows) of the same
family. I want to keep only one observation per family, selecting
that observation according to some other variable. An example data
frame is:

# construct example data

fam <- c(1,2,3,3,4,4,4)

wt <- c(1,1,0.6,0.4,0.4,0.4,0.2)

keep <- c(1,1,1,0,1,0,0)

dat <- as.data.frame(cbind(fam,wt,keep))
dat

I want to keep the observation for which wt is a maximum, and where this doesn't identify a unique observation, to keep just one anyway, not caring which. Those observations are indicated above by keep==1. (Note, keep <- c(1,1,1,0,0,1,0) would be fine too, but not c(1,1,1,0,0,0,1)).

The stata code I would use is

bys fam (wt): keep if _n==_N

This is my (long-winded) attempt in R:

# first keep those rows where wt=max_fam(wt)

maxwt <- by(dat,dat$fam,function(x) max(x[,2]))
maxwt <- sapply(maxwt,"[[",1)

maxwt.dat <- data.frame("maxwt"=maxwt,"fam"=as.integer(names(maxwt)))
dat <- merge(dat,maxwt.dat)

dat <- dat[dat$wt==dat$maxwt,]

dat

Now I am stuck - I want to keep either row with fam==4, and have tried playing around with combinations of sample and apply or by, but with no success. I can only find an inefficient for-loop solution:

# identify those rows with >1 observation

more <- by(dat,dat$fam,function(x) dim(x)[1])
more <- sapply(more,"[[",1)

more.dat <- data.frame("more"=more,"fam"=as.integer(names(more)))
dat <- merge(dat,more.dat)

# sample from those for whom more>1

result<-dat[dat$more==1,]

for(f in unique(dat$fam[dat$more>1])) {

rows <- rownames(dat[dat$fam==f,])

result <- rbind(result,dat[sample(rows,1),])
}

result

I am sure that for something so simple in stata to be so complicated in R must indicate ignorance of R on my part, but searches of help files and RSiteSearch hasn't led to any better solution.

Any suggestions would be most helpful! Thanks, C.

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Aug 02 00:41:03 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:39:38 EST
*