Maximum Parsimony
This short example describes how to use Treeline to optimize maximum parsimony (MP) trees. The MP optimality criterion is fast, easier to interpret, and relies on a cost matrix for state changes.Instructions
First it is necessary to install DECIPHER and load the library in R. Next, provide Treeline with a sequence alignment and cost matrix that will be used to optimize the tree.# load the DECIPHER library in R
> library(DECIPHER)
>
> # load the target sequences from a file
> fas <- "<<REPLACE WITH PATH TO FASTA FILE>>"
> seqs <- readDNAStringSet(fas) # use AA, DNA, or RNA
> seqs
DNAStringSet object of length 317:
width seq names
[1] 819 ATGGCTT...AAGAAAA Rickettsia prowaz...
[2] 822 ATGGGAA...GAAAAAG Porphyromonas gin...
[3] 822 ATGGGAA...GAAAAAG Porphyromonas gin...
[4] 822 ATGGGAA...GAAAAAG Porphyromonas gin...
[5] 819 ATGGCTA...TGGTAAA Pasteurella multo...
... ... ...
[313] 819 ATGGCAA...TACTAAA Pectobacterium at...
[314] 822 ATGCCTA...CGTCAAG Acinetobacter sp....
[315] 864 ATGGGCA...TCAGTCT Thermosynechococc...
[316] 831 ATGGCAC...GAAGAAG Bradyrhizobium ja...
[317] 840 ATGGGCA...GCGAGGT Gloeobacter viola...
>
> # align coding sequences
> seqs <- AlignTranslation(seqs,
+ type="DNAStringSet") # choose AA or DNA
Determining distance matrix based on shared 5-mers:
|========================================| 100%
Time difference of 0.33 secs
Clustering into groups by similarity:
|========================================| 100%
Time difference of 0.02 secs
Aligning Sequences:
|========================================| 100%
Time difference of 0.47 secs
Iteration 1 of 2:
Determining distance matrix based on alignment:
|========================================| 100%
Time difference of 0.05 secs
Reclustering into groups by similarity:
|========================================| 100%
Time difference of 0.03 secs
Realigning Sequences:
|========================================| 100%
Time difference of 0.22 secs
Iteration 2 of 2:
Determining distance matrix based on alignment:
|========================================| 100%
Time difference of 0.05 secs
Reclustering into groups by similarity:
|========================================| 100%
Time difference of 0.02 secs
Realigning Sequences:
|========================================| 100%
Time difference of 0.03 secs
>
> # construct a cost matrix
> costMatrix <- 2*(1 - diag(4))
> colnames(costMatrix) <- DNA_BASES
> rownames(costMatrix) <- DNA_BASES
> costMatrix["A", "G"] <- 1
> costMatrix["G", "A"] <- 1
> costMatrix["C", "T"] <- 1
> costMatrix["T", "C"] <- 1
>
> # optimize the tree
> tree <- Treeline(seqs,
+ method="MP",
+ model=MODELS, # choose a model or test all
+ showPlot=TRUE,
+ processors=NULL) # use all CPUs
Optimizing up to 400 candidate trees:
Tree #145. score = 16263.000 (0.000%), 8 Climbs, 0 Grafts of 14
Finalizing the best tree (#79):
score = 16263.000 (0.000%), 0 Climbs
Time difference of 20.47 secs
>
> # optionally, output a Newick file
> WriteDendrogram(tree, file="")
((('Chlorobium tepidum TLS':0.1808308,('Geobacter sulfurreducens PCA':0.1699809,...