Re: [R] matching country name tables from different sources

From: McGehee, Robert <>
Date: Wed 11 Jan 2006 - 06:28:27 EST

I would throw a tolower() around s1 and s2 so that 'canada' matches with 'CANADA', and perhaps consider using a Levenshtein distance rather than the longest common subsequence.

An algorithm for Levenshtein distance can be found here (courtesy of Stephen Upton)


-----Original Message-----
From: Werner Wernersen [] Sent: Tuesday, January 10, 2006 2:00 PM
To: Gabor Grothendieck
Subject: Re: [R] matching country name tables from different sources

Thanks for the nice code, Gabor!   

  Unfortunately, it seems not to work for my purpose, confuses lots of countries when I compare two lists of over 150 countries each.   Do you have any other suggestions?      

Gabor Grothendieck <> schrieb: If they were the same you could use merge. To figure out the correspondence automatically or semiautomatically, try this:

x <- c("Canada", "US", "Mexico")
y <- c("Kanada", "United States", "Mehico") result <- outer(x, y, function(x,y) mapply(lcs2, x, y)) result[] <- sapply(result, nchar)

# try both which.max and which.min and if you are lucky
# one of them will give unique values and that is the one to use
# In this case which.max does.

apply(result, 1, which.max) # 1 2 3

# calculate longest common subsequence between 2 strings lcs2 <- function(s1,s2) {

     longest <- function(x,y) if (nchar(x) > nchar(y)) x else y
     # Make sure args are strings
     a <- as.character(s1); an <- nchar(s1)+1
     b <- as.character(s2); bn <- nchar(s2)+1

     # If one arg is an empty string, returns the length of the other
     if (nchar(a)==0) return(nchar(b))
     if (nchar(b)==0) return(nchar(a))

     # Initialize matrix for calculations
     m <- matrix("", nrow=an, ncol=bn)

     for (i in 2:an)
          for (j in 2:bn)

  m[i,j] <- if (substr(a,i-1,i-1)==substr(b,j-1,j-1))    paste(m[i-1,j-1], substr(a,i-1,i-1), sep = "")   else
   longest(m[i-1,j], m[i,j-1])
     # Returns the distance


On 1/10/06, Werner Wernersen
> Hi,
> Before I reinvent the wheel I wanted to kindly ask you for your
opinion if there is a simple way to do it.
> I want to merge a larger number of tables from different data sources
in R and the matching criterium are country names. The tables are of different size and sometimes the country names do differ slightly.
> Has anyone done this or any recommendation on what commands I should
look at to automize this task as much as possible?
> Thanks a lot for your effort in advance.
> All the best,
> Werner
> ---------------------------------
> Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
> [[alternative HTML version deleted]]
> ______________________________________________
> mailing list
> PLEASE do read the posting guide!

        [[alternative HTML version deleted]] mailing list PLEASE do read the posting guide! mailing list PLEASE do read the posting guide! Received on Wed Jan 11 06:43:43 2006

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:59 EST