Maximum Likelihood
This short example describes how to use Treeline to optimize maximum likelihood (ML) trees. The ML optimality criterion is grounded in the theory of molecular evolution and popular, although it is relatively slow.Instructions
First it is necessary to install DECIPHER and load the library in R. Next, provide Treeline with a sequence alignment and model of evolution that will be used to optimize the tree.# load the DECIPHER library in R
> library(DECIPHER)
>
> # load the target sequences from a file
> fas <- "<<REPLACE WITH PATH TO FASTA FILE>>"
> seqs <- readDNAStringSet(fas) # use AA, DNA, or RNA
> seqs
DNAStringSet object of length 317:
width seq names
[1] 819 ATGGCTT...AAGAAAA Rickettsia prowaz...
[2] 822 ATGGGAA...GAAAAAG Porphyromonas gin...
[3] 822 ATGGGAA...GAAAAAG Porphyromonas gin...
[4] 822 ATGGGAA...GAAAAAG Porphyromonas gin...
[5] 819 ATGGCTA...TGGTAAA Pasteurella multo...
... ... ...
[313] 819 ATGGCAA...TACTAAA Pectobacterium at...
[314] 822 ATGCCTA...CGTCAAG Acinetobacter sp....
[315] 864 ATGGGCA...TCAGTCT Thermosynechococc...
[316] 831 ATGGCAC...GAAGAAG Bradyrhizobium ja...
[317] 840 ATGGGCA...GCGAGGT Gloeobacter viola...
>
> # align coding sequences
> seqs <- AlignTranslation(seqs,
+ type="DNAStringSet") # choose AA or DNA
Determining distance matrix based on shared 5-mers:
|========================================| 100%
Time difference of 0.32 secs
Clustering into groups by similarity:
|========================================| 100%
Time difference of 0.02 secs
Aligning Sequences:
|========================================| 100%
Time difference of 0.49 secs
Iteration 1 of 2:
Determining distance matrix based on alignment:
|========================================| 100%
Time difference of 0.05 secs
Reclustering into groups by similarity:
|========================================| 100%
Time difference of 0.03 secs
Realigning Sequences:
|========================================| 100%
Time difference of 0.22 secs
Iteration 2 of 2:
Determining distance matrix based on alignment:
|========================================| 100%
Time difference of 0.05 secs
Reclustering into groups by similarity:
|========================================| 100%
Time difference of 0.02 secs
Realigning Sequences:
|========================================| 100%
Time difference of 0.03 secs
>
> # optimize the tree
> tree <- Treeline(seqs,
+ method="ML",
+ model=MODELS, # choose a model or test all
+ showPlot=TRUE,
+ processors=NULL) # use all CPUs
Fitting initial tree to models:
JC69 -ln(L) = 71307, AICc = 147914, BIC = 146854
K80 -ln(L) = 70693, AICc = 146721, BIC = 145633
F81 -ln(L) = 71270, AICc = 147947, BIC = 146800
HKY85 -ln(L) = 70650, AICc = 146745, BIC = 145568
T92 -ln(L) = 70682, AICc = 146736, BIC = 145619
TN93 -ln(L) = 70308, AICc = 146096, BIC = 144890
SYM -ln(L) = 69984, AICc = 145449, BIC = 144242
GTR -ln(L) = 69699, AICc = 144991, BIC = 143692
The selected model was: GTR
Optimizing up to 400 candidate trees:
Tree #133. -ln(L) = 69081.540 (0.000%), 13 Climbs, 0 Grafts of 12
Finalizing the best tree (#90):
-ln(L) = 69081.539 (0.000%), 0 Climbs
Model parameters:
Frequency(A) = 0.253
Frequency(C) = 0.227
Frequency(G) = 0.369
Frequency(T) = 0.151
Rate A <-> C = 1.760
Rate A <-> G = 2.133
Rate A <-> T = 3.914
Rate C <-> G = 1.107
Rate C <-> T = 5.616
Rate G <-> T = 1.000
Time difference of 307.1 secs
>
> # optionally, output a Newick file
> WriteDendrogram(tree, file="")
((((('Thermoanaerobacter tengcongensis MB4':0.2144063,('Clostridium acetobutylicum ATCC 824':0.1143762,...