DECIPHER - Maximum Parsimony

Maximum Parsimony

This short example describes how to use Treeline to optimize maximum parsimony (MP) trees. The MP optimality criterion is fast, easier to interpret, and relies on a cost matrix for state changes.

For an in-depth tutorial on phylogenetics, see the "Growing Phylogenetic Trees with Treeline" vignette, available from the Documentation page.

How do I build a maximum parsimony phylogenetic tree?

First it is necessary to install DECIPHER and load the library in R. Next, provide Treeline with a sequence alignment and cost matrix that will be used to optimize the tree.

Hide output

# load the DECIPHER library in R
> library(DECIPHER)
> 
> # load the target sequences from a file
> fas <- "<<REPLACE WITH PATH TO FASTA FILE>>"
> seqs <- readDNAStringSet(fas) # use AA, DNA, or RNA
> seqs
DNAStringSet object of length 317:
      width seq               names               
  [1]   819 ATGGCTT...AAGAAAA Rickettsia prowaz...
  [2]   822 ATGGGAA...GAAAAAG Porphyromonas gin...
  [3]   822 ATGGGAA...GAAAAAG Porphyromonas gin...
  [4]   822 ATGGGAA...GAAAAAG Porphyromonas gin...
  [5]   819 ATGGCTA...TGGTAAA Pasteurella multo...
  ...   ... ...
[313]   819 ATGGCAA...TACTAAA Pectobacterium at...
[314]   822 ATGCCTA...CGTCAAG Acinetobacter sp....
[315]   864 ATGGGCA...TCAGTCT Thermosynechococc...
[316]   831 ATGGCAC...GAAGAAG Bradyrhizobium ja...
[317]   840 ATGGGCA...GCGAGGT Gloeobacter viola...
> 
> # align coding sequences
> seqs <- AlignTranslation(seqs,
+ type="DNAStringSet") # choose AA or DNA
Determining distance matrix based on shared 5-mers:
  |========================================| 100%


Time difference of 0.33 secs


Clustering into groups by similarity:
  |========================================| 100%


Time difference of 0.02 secs


Aligning Sequences:
  |========================================| 100%


Time difference of 0.47 secs


Iteration 1 of 2:


Determining distance matrix based on alignment:
  |========================================| 100%


Time difference of 0.05 secs


Reclustering into groups by similarity:
  |========================================| 100%


Time difference of 0.03 secs


Realigning Sequences:
  |========================================| 100%


Time difference of 0.22 secs


Iteration 2 of 2:


Determining distance matrix based on alignment:
  |========================================| 100%


Time difference of 0.05 secs


Reclustering into groups by similarity:
  |========================================| 100%


Time difference of 0.02 secs


Realigning Sequences:
  |========================================| 100%


Time difference of 0.03 secs


> 
> # construct a cost matrix
> costMatrix <- 2*(1 - diag(4))
> colnames(costMatrix) <- DNA_BASES
> rownames(costMatrix) <- DNA_BASES
> costMatrix["A", "G"] <- 1
> costMatrix["G", "A"] <- 1
> costMatrix["C", "T"] <- 1
> costMatrix["T", "C"] <- 1
> 
> # optimize the tree
> tree <- Treeline(seqs,
+ method="MP",
+ model=MODELS, # choose a model or test all
+ showPlot=TRUE,
+ processors=NULL) # use all CPUs
Optimizing up to 400 candidate trees:
Tree #145. score = 16263.000 (0.000%), 8 Climbs, 0 Grafts of 14   


Finalizing the best tree (#79):
score = 16263.000 (0.000%), 0 Climbs             


Time difference of 20.47 secs


> 
> # optionally, output a Newick file
> WriteDendrogram(tree, file="")
((('Chlorobium tepidum TLS':0.1808308,('Geobacter sulfurreducens PCA':0.1699809,...