DECIPHER - Align Translation

Align Translation

This short example describes how to use DECIPHER to align a set of protein coding DNA sequences, as described in:

ES Wright (2015) "DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment." BMC Bioinformatics, doi:10.1186/s12859-015-0749-z.

For an in-depth tutorial on sequence alignment, see the "The Art of Multiple Sequence Alignment in R" vignette, available from the Documentation page.

How do I align protein coding sequences?

First it is necessary to install DECIPHER and load the library in R. Next, set the "fas" variable to the path to the FASTA file of unaligned sequences (e.g., "~/mySeqs.fas").

The sequences will be translated before alignment, which requires knowledge of the reading frame for each sequence. By default, DECIPHER guesses the reading frame of the sequences, but it can also be specified (see below). The newly aligned amino acid (AA) sequences are "reverse translated" by inserting gaps into the original sequences.

Hide output

# load the DECIPHER library in R
> library(DECIPHER)
> 
> # specify the path to the FASTA file (in quotes)
> fas <- "<<REPLACE WITH PATH TO FASTA FILE>>"
> 
> # load the sequences from the file
> seqs <- readDNAStringSet(fas)
> 
> # look at some of the sequences (optional)
> seqs
  A DNAStringSet instance of length 317
      width seq                   names               
  [1]   819 ATGGCTTTA...AAAAGAAAA 1
  [2]   822 ATGGGAATA...AGGAAAAAG 2
  [3]   822 ATGGGAATA...AGGAAAAAG 3
  [4]   822 ATGGGAATA...AGGAAAAAG 4
  [5]   819 ATGGCTATC...CGTGGTAAA 5
  ...   ... ...
[313]   819 ATGGCAATT...CGTACTAAA 313
[314]   822 ATGCCTATT...CGCGTCAAG 314
[315]   864 ATGGGCATT...CGTCAGTCT 315
[316]   831 ATGGCACTG...CGGAAGAAG 316
[317]   840 ATGGGCATT...GGGCGAGGT 317
> 
> # for help, see the AlignTranslation help page (optional)
> ?AlignTranslation
> 
> # perform the alignment via the translations
> # change NA to 1, 2 or 3 if the readingFrame is known
> aligned <- AlignTranslation(seqs,
+    readingFrame=NA,
+    type="AAStringSet") # return AA or DNA sequences?
Determining distance matrix based on shared 4-mers:
  |============================================| 100%


Time difference of 1.88 secs


Clustering into groups by similarity:
  |============================================| 100%


Time difference of 0.62 secs


Aligning Sequences:
  |============================================| 100%


Time difference of 4.63 secs


Determining distance matrix based on alignment:
  |============================================| 100%


Time difference of 0.34 secs


Reclustering into groups by similarity:
  |============================================| 100%


Time difference of 0.44 secs


Realigning Sequences:
  |============================================| 100%


Time difference of 5.16 secs


Refining the alignment:
  |============================================| 100%


Time difference of 0.01 secs


> 
> # view the alignment in a browser (optional)
> BrowseSeqs(aligned, highlight=0)
> 
> # write the alignment to a new FASTA file
> writeXStringSet(aligned,
+    file="<<REPLACE WITH PATH TO OUTPUT FASTA FILE>>")