Re: [R] how to group a large list of strings into categories based on string similarity?

From: G FANG <fanggangsw_at_gmail.com>
Date: Sat, 26 Jun 2010 11:42:30 -0700

Hi Martin,

Thanks a lot for your advice.

I tried the process you suggested as below, it worked, but in a different way that I planned.

library(Biostrings)
x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",

      "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
      "CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
      "CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
      "NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
      "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
      "NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
names(x) <- seq_along(x)
dna <- DNAStringSet(x)
while (!all(width(dna) == width(dna <- trimLRPatterns("N", "N", dna)))) {} names(dna)[order(dna)[rank(dna, ties.method="min")]]

The output is,
"1" "2" "3" "4" "4" "6" "4", this is the right answer after trimining N's, i.e. without considering N, which strings are the same.

But actually, the match I planned is position-to-position match, i.e. 1st and 2nd strings are the same except for the N's

So, the expected output is 1 1 2 2 3 2 4

Please advice.

Thanks!

--gang

On Wed, Jun 23, 2010 at 7:55 PM, Martin Morgan <mtmorgan_at_fhcrc.org> wrote:

> On 06/23/2010 07:46 PM, Martin Morgan wrote:
>> On 06/23/2010 06:55 PM, G FANG wrote:

>>> Hi,
>>>
>>> I want to group a large list (20 million) of strings into categories
>>> based on string similarity?
>>>
>>> The specific problem is: given a list of DNA sequence as below
>>>
>>> ACTCCCGCCGTTCGCGCGCAGCATGATCCTG
>>> ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN
>>> CAGGATCATGCTGCGCGCGAACGGCGGGAGT
>>> CAGGATCATGCTGCGCGCGAANNNNNNNNNN
>>> CAGGATCATGCTGCGCGCGNNNNNNNNNNNN
>>> ......
>>> .....
>>> NNNNNNNCCGTTCGCGCGCAGCATGATCCTG
>>> NNNNNNNNNNNNCGCGCGCAGCATGATCCTG
>>> NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT
>>> NNNNNNNNNNNNNNCGCGCAGCATGATCCTG
>>> NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT
>>> NNNNNNNNNNTTCGCGCGCAGCATGATCCTG
>>>
>>> 'N' is the missing letter
>>>
>>> It can be seen that some strings are the same except for those N's
>>> (i.e. N can match with any base)
>>>
>>> given this list of string, I want to have
>>>
>>> 1) a vector corresponding to each row (string), for each string assign
>>> an id, such that similar strings (those only differ at N's) have the
>>> same id
>>> 2) also get a mapping list from unique strings ('unique' in term of
>>> the same similarity defined above) to the ids
>>>
>>> I am a matlab user shifting to R. Please advice on efficient ways to do this.
>>
>> The Bioconductor Biostrings package has many tools for this sort of
>> operation. See http://bioconductor.org/packages/release/Software.html
>>
>> Maybe a one-time install
>>
>>    source('http://bioconductor.org/biocLite.R')
>>    biocLite('Biostrings')
>>
>> then
>>
>>   library(Biostrings)
>>   x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG",
>>         "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN",
>>         "CAGGATCATGCTGCGCGCGAACGGCGGGAGT",
>>         "CAGGATCATGCTGCGCGCGAANNNNNNNNNN",
>>         "NCAGGATCATGCTGCGCGCGAANNNNNNNNN",
>>         "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN",
>>         "NNNCAGGATCATGCTGCGCGCGAANNNNNNN")
>>   names(x) <- seq_along(x)
>>   dna <- DNAStringSet(x)
>>   while (!all(width(dna) ==
>>               width(dna <- trimLRPatterns("N", "N", dna)))) {}
>>   names(dna)[rank(dna)]
>
> oops, maybe closer to
>
>   names(dna)[order(dna)[rank(dna, ties.method="min")]]
>
>> although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also,
>> your sequences likely come from a fasta file (Biostrings::readFASTA) or
>> a text file with a column of sequences (ShortRead::readXStringColumns)
>> or from alignment software (ShortRead::readAligned /
>> ShortRead::readFastq). If you go this route you'll want to address
>> questions to the Bioconductor mailing list
>>
>>   http://bioconductor.org/docs/mailList.html
>>
>> Martin
>>

>>> Thanks!
>>>
>>> Gang
>>>
>>> ______________________________________________
>>> R-help_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 26 Jun 2010 - 18:46:33 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 27 Jun 2010 - 15:10:35 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive