Re: [R] analyze amino acid sequence (composition)of proteins

From: Jean lobry <>
Date: Tue 20 Jun 2006 - 04:53:33 EST

>i am a newer to R and i am doing some protein category classification based on
>the amino acid sequence.while i have some questions urgently.

I'm not sure to understand exactly what you are trying to do, but if this is a kind of clustering from the amino-acid composition of proteins beware that the genomic G+C content is a major confounding factor.

>1. any packages for analysis amino acid sequence

There are some utilities in the seqinR package to retrieve protein sequences from various protein databases such as SwissProt. The list of available databases (including nucleic databases) is given by:


  [1] "genbank"      "embl"         "emblwgs"      "swissprot"    "ensembl"
  [6] "ensemblnew"   "emglib"       "nrsub"        "nbrf"         "hobacnucl"
[11] "hobacprot"    "hovernucl"    "hoverprot"    "hogennucl"    "hogenprot"
[16] "hoverclnu"    "hoverclpr"    "homolensprot" "homolensnucl" "HAMAPnucl"
[21] "HAMAPprot"    "hoppsigen"    "nurebnucl"    "nurebprot"    "taxobacgen"
[26] "greview"      "hogendnucl"   "hogendprot"   "refseq"

More info with the infobank argument set to TRUE:

   choosebank(infobank = TRUE)[1:5,]

        bank status
1 genbank on GenBank Rel. 154 (15 June 2006) Last Updated: Jun 18, 2006

2      embl     on                                   EMBL Library 
Release 86 (March 2006)
3   emblwgs     on           EMBL Whole Genome Shotgun sequences 
Release 86  (March 2006)
4 swissprot     on UniProt Rel. 7 (SWISS-PROT 49 + TrEMBL 32): Last 
Updated: Jun  6, 2006
5   ensembl     on 

Ensembl databases rel 24

There are also some tools to compute protein indices such as the Kyte and Doolittle hydrophaty index, or the PI (Isoelectric point of a protein). See the examples in this document:

Most protein indices are linear forms on amino acid relative frequencies whose coefficients are available in the AAindex (

>2. given two sequences "AAA" and "BBB",how can i combine them into "AAABBB"

paste() is your friend:

   paste("AAA", "BBB", sep = "")
[1] "AAABBB" Or define your own infix binary operator for convenience:

   "%+%" <- function(...) paste(..., sep = "")    "AAA" %+% "BBB" %+% "CCC"
>3. based on "AAABBB",how can i get some statistics of this string such as how
>many letters,how many "A"s in the string.


3 3

With a real protein sequence:

  prot <- read.fasta(File = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype = "AA")

If you want the three letter code instead:

   table(prot) -> tmp
   names(tmp) <- aaa(names(tmp))
Stp Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr

   1 8 6 6 18 6 8 1 9 14 29 5 7 10 9 13 16 7 6 3 1

Try also:


Hope this helps,

Jean R. Lobry            (
Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - LYON I,
43 Bd 11/11/1918, F-69622 VILLEURBANNE CEDEX, FRANCE
allo  : +33 472 43 12 87     fax    : +33 472 43 13 88

______________________________________________ mailing list
PLEASE do read the posting guide!
Received on Tue Jun 20 05:04:17 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 20 Jun 2006 - 06:12:17 EST.

Mailing list information is available at Please read the posting guide before posting to the list.