DECIPHER - IDTAXA Classify Organisms FAQ

IDTAXA Classify Organisms - Frequently Asked Questions:

Where is IDTAXA described?
A Murali et al. (2018) "IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences." Microbiome, doi:10.1186/s40168-018-0521-5.
Can I use IDTAXA on my computer?
Yes, please install DECIPHER and then look at the code page.
What does "modified" mean?
1. IDTAXA has the ability to identify training sequences with a high probabliliy of being mislabeled. Modified training sets have these sequences automatically removed when learning the classifier.
2. Modified sets are limited to 100 sequences per taxonomic group because this decreases the likelihood of including rare mislabeled sequences in very large groups.
3. In some cases there are training set specific modifications:
  - GTDB 16S: removal of very long sequences (> 2,000 nucleotides) and a few that don't remotely match any highly conserved 16S patterns.
  - GTDB/RDP 16S: Amended with mitochondrial 16S, eukaryotic 18S, and PhiX sequence representatives. These are common contaminants found in 16S sequence sets.
  - SILVA 16S: removal of putative chimeras with Find Chimeras, sequences with ≥ 10 ambiguities, sequences classified to a non-basal taxon, and those that don't remotely match any highly conserved 16S patterns. To make the SILVA taxonomy compatible, basal groups marked as "uncultured" were removed, and a unique suffix was added to duplicated taxon names at the same rank level.
Can you perform species level classification?
It depends on the training set. For 16S training sets classification is up to the genus level because there are no species labels. Note that it is well known that the 16S gene is too conserved to obtain species level identification, even when the full length sequence is available. This has been shown by a number of published studies, and we have confirmed this in our own work (see here). For example, strains with identical 16S sequences can have as little as 40% gene content similarity, making it impossible to ascertain anything that resembles a species level classification even with full length error-free sequences.
Which training set is best for 16S sequences?
For 16S we recommend the GTDB training set because it is the most recently published reference taxonomy. The SILVA training set has the most breadth, so it is likely to yield taxonomic names for the most sequences, although some names might be esoteric. Please be aware of the licensing information associated with use of the SILVA dataset. The Contax training set is based on agreement among multiple reference taxonomies, so it probably contains the least labeling error (where training sequences are misassigned) but also least breadth.