DECIPHER - Align Synteny

Align Synteny

This short example describes how to use DECIPHER to align genomes. For an in-depth tutorial on sequence alignment, see "The Art of Multiple Sequence Alignment in R", available from the Documentation page.

How do I align genomes?

First install DECIPHER and load the library in R. Next, it is necessary to build a sequence database using DECIPHER. A sequence database is simply another type of file, but it contains all of the genomes organized in such a way that they can be efficiently accessed by their unique name ('identifier').

Individual genomes can be composed of one sequence or many thousands of sequences. Here it is assumed that each genome is in its own FASTA file when initializing the database, although there are other ways to build a sequence database. For an alternative example, where the genomes are downloaded from NCBI GenBank directly, see the Examples Gallery.

Once the sequence database is constructed, the process of mapping syntenic regions and aligning them only requires two lines of code. Plotting of the synteny map is also described here, but is an optional step. Note that there are many more options available, as described in the help documentation.

Hide output

# load the DECIPHER library in R
> library(DECIPHER)
> 
> # specify the path to each FASTA file (in quotes)
> # each genome must be given a unique identifier here
> # for example: Genome1, Genome2, etc.
> fas <- c(Genome1="<<REPLACE WITH PATH TO FASTA FILE 1>>",
+    Genome2="<<REPLACE WITH PATH TO FASTA FILE 2>>",
+    Genome3="<<REPLACE WITH PATH TO FASTA FILE 3>>")
> 
> # specify where to create the new sequence database
> db <- "<<REPLACE WITH PATH TO SEQUENCE DATABASE>>"
> 
> # load the sequences from the file in a loop
> for (i in seq_along(fas)) {
+    Seqs2DB(fas[i], "FASTA", db, names(fas[i]))
+ }
Reading FASTA file from line  1 to 1e+05


Added 1 new sequence to table DNA.
1 total sequence in table DNA.
Time difference of 1 secs


Reading FASTA file from line  1 to 1e+05


Added 1 new sequence to table DNA.
2 total sequences in table DNA.
Time difference of 1 secs


Reading FASTA file from line  1 to 1e+05


Added 1 new sequence to table DNA.
3 total sequences in table DNA.
Time difference of 1 secs


> 
> # map the syntenic regions between each genome pair
> synteny <- FindSynteny(db)
  |============================================| 100%


Time difference of 26.3 secs


> 
> # print a summary of the syntenic map (optional)
> synteny
          Genome1   Genome2  Genome3
Genome1     1 seq  49% hits 62% hits
Genome2 39 blocks     1 seq 50% hits
Genome3 19 blocks 25 blocks    1 seq
> 
> # view the syntenic regions (optional)
> pairs(synteny) # displays a dot plot of all pairs
> plot(synteny) # displays a bar plot of adjacent pairs
> 
> # perform alignments of all pairs of genomes
> DNA <- AlignSynteny(synteny, db)
  |============================================| 100%


Time difference of 62.18 secs


> 
> # view the aligned syntenic blocks for each pair
> unlist(DNA[[1]]) # Genomes 1 and 2
  A DNAStringSet instance of length 78
      width seq                   names               
 [1] 345520 AATTTCGAA...AACTGCATC 1.Genome1
 [2] 345520 AATTTCGAA...AACTGCCTC 1.Genome2
 [3] 332175 TATTTCTTT...AGTTATGAT 2.Genome1
 [4] 332175 TATTTCTTC...AGCTATGAT 2.Genome2
 [5] 152776 AAGTACACA...GTAGCGGAT 3.Genome1
 ...    ... ...
[74]   2570 ACTGTCACA...GATAAAAAA 37.Genome2
[75]   5479 TACTGCGTT...TAATAACAA 38.Genome1
[76]   5479 TAGTAGGTA...TCAAAATAA 38.Genome2
[77]   1219 AGCAAAAAA...TGCAAAAAA 39.Genome1
[78]   1219 AGAAAAAAA...TGCAAAAAA 39.Genome2
> unlist(DNA[[2]]) # Genomes 1 and 3
  A DNAStringSet instance of length 38
      width seq                   names               
 [1] 420181 AAGTACACA...TATAGCAAA 1.Genome1
 [2] 420181 AGGTACACA...TGTAGCAAA 1.Genome3
 [3] 413750 ATCATTATT...TTTTCACAT 2.Genome1
 [4] 413750 ATAATTATT...TTTTTACAT 2.Genome3
 [5] 282526 ATGCCCTCG...TATTTTAAA 3.Genome1
 ...    ... ...
[34]  42624 TCACTATAA...ATATAAAAC 17.Genome3
[35]  25576 GGACGTAAA...AAAAAAACA 18.Genome1
[36]  25576 GGACGCAAA...AAAAAATCA 18.Genome3
[37]  15118 TTAATTAAG...TACTGAAAT 19.Genome1
[38]  15118 TTAAATAAG...TACTAAAAT 19.Genome3
> unlist(DNA[[3]]) # Genomes 2 and 3
  A DNAStringSet instance of length 50
      width seq                   names               
 [1] 676200 TTTTCATTT...TTAATAGAA 1.Genome2
 [2] 676200 TTTTTATTT...TTAAGAAAA 1.Genome3
 [3] 422433 TTGTCCATT...AACTGCCTC 2.Genome2
 [4] 422433 TCGTTTTTT...AACTGCATC 2.Genome3
 [5] 238313 ACTGTCACA...TTAAATGCT 3.Genome2
 ...    ... ...
[46]  10690 TAGAAAATA...AGAAAATGG 23.Genome3
[47]   1400 CATATCCTC...AGCAAAAAA 24.Genome2
[48]   1400 CATATCATT...AGCCTAAAA 24.Genome3
[49]   1637 CGAATTAAA...AAAATAAAA 25.Genome2
[50]   1637 CAAAAAAAA...AAAATAAAA 25.Genome3