• Home
  • Uncategorized
  • Predicting host-pathogen interactions using a proteome-scale language model

ProteomeLM is a proteome-scale language model trained on proteomes spanning the tree of life to reconstruct masked protein embeddings from proteome context within each species. Its attention coefficients capture protein-protein interactions without supervision. Here, we show that this capability extends to cross-species host-pathogen interactions (HPI) across ten human pathogen taxa spanning viruses and bacteria, and can be further improved with lightweight fine-tuning. We introduce ProteomeLM-HPI, a parameter-efficient adaptation via LoRA, trained on concatenated host-pathogen proteomes to reconstruct masked pathogen embeddings from host context. ProteomeLM-HPI involves two key design choices: asymmetric masking (pathogen-heavy masking) and blocked self-attention. Systematic ablations show that both choices contribute. To assess generalization, we introduce a strict cross-species benchmark enforcing pathogen-level holdout and 40% sequence-identity filtering. On this benchmark, Proteome-HPI improves AUC on 9 out of 10 unseen pathogens.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844