DECIPHER - Bioinformatics Video - Accurately clustering with Clusterize

Accurately clustering enormous numbers of sequences with Clusterize

Originally presented at the ISMB 2023 conference in Lyon, France. Exponential growth in the volume of biological sequences presents ever-increasing bioinformatics challenges. Clustering is often a first step in bioinformatics workflows to reduce sequences to more manageable numbers. Therefore, clustering presents a scalability bottleneck that is constrained by time and memory limitations. Here, I describe the development of Clusterize, a novel method for accurately clustering with linear time and memory complexity. The Clusterize algorithm linearizes the clustering problem through a process termed relatedness sorting. After linear-time relatedness sorting, each sequence only needs to be compared to fixed number of nearby sequences in the ordering and clustered if within the similarity threshold. I compare the performance of Clusterize relative to popular clustering programs, including CD-HIT, MMseqs, and UCLUST. Clusterize is able to quickly and accurately cluster tens of millions of homologous sequences, such as 16S amplicons, and non-homologous sequences, such as the UniProt database. Clusterize is far more accurate than another linear time clustering algorithm, Linclust, on many typical clustering tasks. Overall, Clusterize represents a novel approach for scaling clustering to new heights while assisting with the continual flood of biological sequences. Clusterize is part of the DECIPHER package for R available from the Bioconductor package repository.