arXiv:2509.24186v2 Announce Type: replace-cross
Abstract: Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks — including expert preferences, holistic clinical tasks, safety judgments, and open-ended queries — achieving 4 wins, 0 losses, and 18% lower variance. As a substantive finding, topic-level competency profiles expose striking domain-specific heterogeneity that aggregate accuracy masks. As a diagnostic tool, difficulty-tier analysis reveals two distinct response profiles (difficulty-sensitive responding and difficulty-insensitive responding) that require fundamentally different interventions. These results establish item-aware psychometric evaluation as a more valid and stable foundation for assessing LLMs in medicine, with potential implications for any high-stakes domain where benchmark integrity can be validated, and items vary meaningfully in difficulty and discrimination.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844