DECIPHER - IDTAXA Classify Organisms Inputs

IDTAXA Classify Organisms - Inputs:

Training set:
Please select the training set that you wish to use for classification. The training sets differ in their breadth as well as the degree of mislabled sequences they contain. For 16S we recommend either the RDP training set or GTDB because these are derived from authoritative reference taxonomies. The RDP taxonomy originates from Bergy's Manual of Systematic Bacteriology and the GTDB taxonomy is generated from a core-genome alignment. The training sets marked as "modified" have been slightly modified or ammended. Please see the F.A.Q. page for more details.
Confidence level:
Select a minimum confidence threshold for classifications. We recommend using a confidence of 60% (very high) or 50% (high). Longer sequences are easier to classify because they contain more information, so a larger fraction of sequences will be classified at the same confidence threshold.
The primary error mode of sequence classifiers is overclassification, where a sequence belonging to a novel group is assigned to an existing taxonomic group, and the overclassification rate is largely independent of sequence length. Therefore, it is not necessary to change the confidence threshold for shorter input sequences.
FASTA File:
Choose a text file containing the sequence records that you wish to classify. An example input file containing 16S sequences belonging to 20 organisms from the Human Microbiome Project mock community can be downloaded here. Some general remarks about input files:
- Sequences must be in FASTA format where each new sequence record begins with a ">" symbol on a single line containing the description, and subsequent lines contain the actual sequence (there are no restrictions on the number of nucleotides per line).
- Sequences can be aligned (with gaps) or unaligned (without gaps). DECIPHER will remove gaps from a sequence before running the IDTAXA algorithm.
- The size of the uploaded file is restricted to be less than 100 MB. This limit corresponds to roughly 75,000 nearly full-length (≥ 1,200 nucleotide) 16S sequences without gaps, but gaps in alignment files can reduce this number significantly. Therefore, we recommend that users upload unaligned sequences to maximize the number of sequences that can be tested at a time.