DECIPHER - Video #7

Community detection at unprecedented scales with ExoLabel

Originally presented at the ISMB 2025 conference in Liverpool, England.

Many approaches in comparative genomics rely on clusters of orthologous genes (COGs). Methods for constructing COGs often employ community detection algorithms to identify clusters within a network of pairwise similarities among genes. As the number of available genome sequences continues to grow exponentially, this community detection step has proven to be the limiting factor for scaling COGs to more genomes — both in terms of memory and time required. In this study, we developed ExoLabel, a community detection program that can scale to enormous graphs by applying a linear-time algorithm to data outside of memory (i.e., on disk). We show that ExoLabel's accuracy rivals popular programs for identify COGs but is orders of magnitude faster and more memory efficient that existing programs. We demonstrate ExoLabel's performance by clustering a graph with 16.2 million nodes (genes) and 18.3 billion edges (pairwise similarities) in less than a day using only a few gigabytes of RAM. ExoLabel democratizes comparative genomics in settings without access to supercomputers and scales COG detection to new heights.