Predicting gene functional associations from coevolutionary signals with EvoWeaver
Originally presented at the ISMB 2024 conference in Montreal, Canada.
The universe of uncharacterized proteins is expanding far faster than our ability to annotate their functions through laboratory study. Computational annotation approaches rely on similarity to previously studied proteins, thereby ignoring unstudied proteins. This phenomenon gives rise to a "rich get richer" scenario: the majority of research focuses on a small subset of proteins. Coevolutionary approaches hold promise for injecting new information into our knowledge of the protein universe by linking proteins through 'guilt-by-association'. However, existing coevolutionary algorithms have insufficient accuracy and scalability to connect the entire universe of proteins. We present EvoWeaver, an algorithm that weaves together 12 distinct signals of coevolution to quantify the degree of shared evolution between genes. EvoWeaver's signals encompass phylogenetic profiling, phylogenetic structure, gene organization, and sequence-level methods that broadly capture coevolution between sequences. EvoWeaver accurately identifies proteins involved in protein complexes or separate steps of a biochemical pathway. We demonstrate the merits of EvoWeaver by partly reconstructing known biochemical pathways without any prior knowledge other than genome sequences. Additionally, we show that EvoWeaver's predictions rival those of the widely used STRING database without reliance on prior biological knowledge. Finally, we leverage EvoWeaver's predictions to uncover experimentally validated functional associations among genes that are absent from existing databases. This work forms one of the largest scale analyses of protein functional relationships to date, encompassing 1,545 gene groups from 8,564 genomes. Given its predictive power and speed, EvoWeaver has the potential to revolutionize protein functional prediction at the scale of the protein universe.