Re: [R] deduplication

From: Allan Engelhardt <>
Date: Thu, 03 Jun 2010 17:33:01 +0100

Maybe something like the following will get you started:

g <-, directed=FALSE) neighborhood(g, +Inf)

There is perhaps a more efficient way, but I hope this helps a little.


On 03/06/10 14:14, Epi-schnier wrote:
> Colleagues,
> I am trying to de-duplicate a large (long) database (approx 1mil records) of
> diagnostic tests. Individuals in the database can have up-to 25
> observations, but most will have only one. IDs for de-duplication (names,
> sex, lab number...) are patchy. In a first step, I am using Andreas Borg's
> excellent record linkage package (), that leaves me with a list of 'pairs'
> looking very much like this:
> id1<-c(4,17,9,1,1,1,3,3,6,15,1,1,1,1,3,3,3,3,4,4,4,5,5,12,9,9,10,10)
> id2<-c(8,18,10,3,6,7,6,7,7,16,4,5,12,18,4,5,12,18,5,12,18,12,18,18,15,16,15,16)
> id<-data.frame(cbind(id1,id2))
> where a pair means that the records belong to the same individual (e.g.,
> record 4 and record 8; 17 and 18...). My problem now is to get a list with
> all records that belong to the same person (in the example, obervations
> 1,3,4,5,6,7,8,12, 17 and 18 are all from the same person). The problem is to
> find the link between 1 and 8 (only through 1 and 4 and 4 and 8) and the
> link between 1 and 17 (through 18). I can do it in my head, but I am missing
> the code that would work its way through too many records.
> Any clever ideas?
> (using R 2.10.1 on Windows XP)
> Thanks,
> Christian
> mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Thu 03 Jun 2010 - 16:37:12 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 03 Jun 2010 - 17:10:29 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive