Accurately clustering enormous numbers of sequences with Clusterize
Originally presented at the ISMB 2023 conference in Lyon, France. Exponential growth in the volume of biological sequences presents ever-increasing bioinformatics challenges. Clustering is often a first step in bioinformatics workflows to reduce sequences to more manageable numbers. Therefore, clustering presents a scalability bottleneck that is constrained by time and memory limitations. Here, I describe the development of Clusterize, a novel method for accurately clustering with linear time and memory complexity. The Clusterize algorithm linearizes the clustering problem through a process termed relatedness sorting. After linear-time relatedness sorting, each sequence only needs to be compared to fixed number of nearby sequences in the ordering and clustered if within the similarity threshold. I compare the performance of Clusterize relative to popular clustering programs, including CD-HIT, MMseqs, and UCLUST. Clusterize is able to quickly and accurately cluster tens of millions of homologous sequences, such as 16S amplicons, and non-homologous sequences, such as the UniProt database. Clusterize is far more accurate than another linear time clustering algorithm, Linclust, on many typical clustering tasks. Overall, Clusterize represents a novel approach for scaling clustering to new heights while assisting with the continual flood of biological sequences. Clusterize is part of the DECIPHER package for R available from the Bioconductor package repository.