Antibodies are a leading class of biologics, yet their architecture with conserved framework regions and hypervariable complementarity-determining regions (CDRs) poses unique challenges for computational modeling. We present a region-aware pretraining strategy for paired heavy (VH) and light (VL) sequences in variable domains using ESM2-3B and ESM C (600M) protein language models. We compare three masking strategies: whole-chain, CDR-focused, and a hybrid approach. Through evaluation on binding affinity datasets spanning single-mutant panels and combinatorial mutants, we demonstrate that CDR-focused training produces superior embed-dings for functional prediction. Notably, training only on VH-VL pairs proves sufficient, eliminating the need for massive unpaired pretraining that provides no measurable downstream benefit. Our compact 600M ESM-C model achieves state-of-the-art performance, matching or exceeding larger antibody-specific baselines. These findings establish a principled framework for antibody language models: prioritize paired sequences with CDR-aware supervision over scale and complex training curricula to achieve both computational efficiency and predictive accuracy.
Uncovering Code Insights: Leveraging GitHub Artifacts for Deeper Code Understanding
arXiv:2511.03549v1 Announce Type: cross Abstract: Understanding the purpose of source code is a critical task in software maintenance, onboarding, and modernization. While large language models


