IDTAXA Classify Organisms - Inputs:
Please select the training set that you wish to use for classification. The training sets differ in their breadth as well as the degree of mislabled sequences they contain. For 16S we recommend either the RDP training set or GTDB because these are derived from authoritative reference taxonomies. The RDP taxonomy originates from Bergy's Manual of Systematic Bacteriology and the GTDB taxonomy is generated from a core-genome alignment. The training sets marked as "modified" have been slightly modified or ammended. Please see the F.A.Q. page for more details.
Select a minimum confidence threshold for classifications. We recommend using a confidence of 60% (very high) or 50% (high). Longer sequences are easier to classify because they contain more information, so a larger fraction of sequences will be classified at the same confidence threshold.
The primary error mode of sequence classifiers is overclassification, where a sequence belonging to a novel group is assigned to an existing taxonomic group, and the overclassification rate is largely independent of sequence length. Therefore, it is not necessary to change the confidence threshold for shorter input sequences.
Choose a text file containing the sequence records that you wish to classify. An example input file containing 16S sequences belonging to 20 organisms from the Human Microbiome Project mock community can be downloaded here. Some general remarks about input files: