Convergent evolution to similar proteins confounds structure search
Originally presented at the ISMB 2025 conference in Liverpool, England.
Advances in protein structure prediction and structural search tools (e.g., FoldSeek and PLMSearch) have enabled large-scale comparison of protein structures. It is now possible to quickly identify structurally similar proteins ("structurlogs"), but it remains unclear whether these similarities reflect homology (common ancestry) or analogy (convergent evolution). In this study, we found that ~2.6% of FoldSeek clusters lack sequence-level support for homology, including about 1% of matches with high TM-score (>= 0.5). The lack of sequence homology could be due to extreme protein divergence or independent evolution to a similar structure. Here, we show that tandem repeats provide strong evidence for the presence of analogous protein structures. Our results suggest analogs infiltrate structure search results and care should be taken when relying on structural similarity alone if homology is desired. This problem may extend beyond repeat proteins to other low complexity folds, and structure search tools could be improved by masking these regions in the same manner as done by sequence search programs.