IIncreasing the number of related protein paralogs is important for fully understanding protein relationships, yet it remains challenging for sequences in the twilight zone. Here, we present an integrated homolog detection framework that combines sequence-based (BLASTp, MMseqs2), structure-based (Foldseek), and embedding-distance-based (PROST) similarity metrics to identify additional paralogs. To characterize functionally related protein pairs, we develop protein-family-specific supervised logistic regression models trained on curated functional annotations from MEROPS proteases and KinHub kinases. The resulting model successfully classifies proteins, with ROC-AUC of 0.99 and F1-score of 0.92 for test datasets. Applying this model, we initially identify 686 protease and 298 kinase new candidates. Subsequent structural validation, and previous annotation comparisons yield 7 new protease and 3 new kinase paralogs in the human proteome, mostly lacking prior functional characterization. An additional outcome is structural identification of catalytically important residues for larger numbers of proteases and kinases. Despite the small number of new paralogs for well-studied proteases and kinases, our results demonstrate that integrating orthogonal homolog approaches with family-specific regression models provides a robust, scalable strategy for discovering new functionally related proteins, which is a generalizable approach for novel protein function discovery and can be applied more broadly to under-annotated proteomes.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844