[R] how to group a large list of strings into categories based on string similarity?

From: G FANG <fanggangsw_at_gmail.com>
Date: Wed, 23 Jun 2010 18:55:59 -0700


Hi,

I want to group a large list (20 million) of strings into categories based on string similarity?

The specific problem is: given a list of DNA sequence as below

ACTCCCGCCGTTCGCGCGCAGCATGATCCTG
ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN
CAGGATCATGCTGCGCGCGAACGGCGGGAGT
CAGGATCATGCTGCGCGCGAANNNNNNNNNN
CAGGATCATGCTGCGCGCGNNNNNNNNNNNN
......
.....
NNNNNNNCCGTTCGCGCGCAGCATGATCCTG
NNNNNNNNNNNNCGCGCGCAGCATGATCCTG
NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT
NNNNNNNNNNNNNNCGCGCAGCATGATCCTG
NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT
NNNNNNNNNNTTCGCGCGCAGCATGATCCTG 'N' is the missing letter

It can be seen that some strings are the same except for those N's (i.e. N can match with any base)

given this list of string, I want to have

  1. a vector corresponding to each row (string), for each string assign an id, such that similar strings (those only differ at N's) have the same id
  2. also get a mapping list from unique strings ('unique' in term of the same similarity defined above) to the ids

I am a matlab user shifting to R. Please advice on efficient ways to do this.

Thanks!

Gang



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 24 Jun 2010 - 01:57:36 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 24 Jun 2010 - 03:00:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive