[R] Matching on multiple columns

From: Farrel Buchinsky <fjbuch_at_gmail.com>
Date: Thu 11 Jan 2007 - 20:56:52 GMT


Am I correct in believing that one cannot match on multiple columns? One can indeed subset on multiple criteria from different variables (or columns) but not from unique combinations thereof. I need to exclude about 10000 rows from 108000 rows of data based on several unique combinations of identifiers in two columns. Only merge() seems to be able to do that. Merge would allow me to positively select but it would not allow me to deselect (or exclude). Look at how I got around the problem.
It is inelegant. Have a missed a more direct function?

x<-seq(from=1,to=10)
y <-rep(1:2,5)
data<-data.frame(cbind(x,y))
x<-seq(from=1,to=10)
y <-rep(1:5,2)
data1<-data.frame(cbind(x,y))
newdata<-rbind(data,data1)
newdata[10,2] <- 3
exclude<-newdata[18:20,]
#This is a simulation of real life problem
#We now have two dataframes. Newdata is the data set from an
experiment. In real #life it has an additional 9 columns of data
#exclude is a dataframe that was manually created after discovering
some quality #control issues
#Any row in newdata that matches any row in exclude must be discarded

match(newdata$x,exclude$x)# useless because it is only on one column
match(newdata$y,exclude$y)# useless because it is only on one column
newdata$x %in%  exclude$x  # useless because it is only on one column
newdata$y %in%  exclude$y # useless because it is only on one column
newdata$x %in%  exclude$x &  newdata$y %in%  exclude$y  # useless,

#eventhough it is using both columns, because it is not
#using them in a synchronous manner. Row 10 in new data should not
have been #marked "TRUE"
#It was only labeled such because the 10 in the x column is indeed in
the exclude x #column and the 3 in the y column is indeed in the exclude y column but not #together
which(newdata$x %in% exclude$x & newdata$y %in% exclude$y)#also gets it #wrong
match(newdata,exclude)# intuitively this could have worked but alas match can #only handle vectors and not dataframes. It cannot match on multiple columns
#I have to stoop to the inelegant maneuver of creating a combined
variable of the #two columns, albeit only temporarily paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":")
#or one could do this


match(paste(newdata$x,newdata$y,sep=":"),paste(exclude$x,exclude$y,sep=":"))
#or

{which(paste(newdata$x,newdata$y,sep=":") %in% paste (exclude$x,exclude$y,sep=":"))}
#Or if you want to use the result in an index term or a selection
argument in a subset command
{paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":")==FALSE}
#As in

{newdata[paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":")==FALSE,]}
#or

{subset(newdata,paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":")==FALSE)}
-- 
Farrel Buchinsky

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Fri Jan 12 08:05:34 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 11 Jan 2007 - 23:30:27 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.