[R] Approximate name matching

From: Stavros Macrakis <macrakis_at_alum.mit.edu>
Date: Mon, 09 May 2011 14:30:17 -0400

Is there R software available for doing approximate matching of personal names?

I have data about the same people produced by different organizations and the only matching key I have is the name. I know that commercial solutions exist, and I know I code code this from scratch, but I'd prefer to build on some existing free solution if it exists.

Unfortunately, the names are not standardized, and there is also a certain level of error:

       Danny Williams (nickname)
       Dan Williams (nickname)
       Daniel Williams (nickname)
       Dan William (spelling error)
       D. Williams (initials)
       Daniel "Danny" Williams (formal + nickname)
       Dan P. Williams (includes middle initial)
       Williams, Daniel (different convention)
       William Daniel (wrong order or missing comma + misspelling)

Is there any R software available to find likely matches, ideally with some estimate of accuracy of match? Levenshtein distance as implemented in agrep is a useful solution for some of these cases; I was wondering if there is something that covers more cases.

For this particular application, I am not concerned with issues such as variant latinizations/transliterations (e.g. Tsung-Dao Lee ~ T.D. Lee ~ Li Zhengdao; Ghaddafi ~ Qaddhaffi), but of course if someone handles that as well....



        [[alternative HTML version deleted]]

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 09 May 2011 - 18:34:04 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 09 May 2011 - 18:40:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive