Musculoskeletal pain, especially low-back pain, is highly prevalent and often challenging to manage due to its multifactorial nature. Effective diagnosis and therapy require clinicians to integrate biopsychosocial information within an evidence-based clinical reasoning framework. Large language models that “think” before responding, so-called reasoning models, show promise to support such complex decision-making, yet their validity and reliability in this setting remain unclear. In our work, we present a comprehensive human evaluation of reasoning models for clinical reasoning. Our results indicate that state-of-the art reasoning models demonstrate sufficient test–retest reliability and are competent or proficient in terms of their conceptual reasoning, completeness, correctness, relevance, and usefulness, with no statistically significant or clinically relevant differences between them. However, our qualitative analysis reveals weaknesses in logical coherence, patient-centeredness, empathy, and intuition, with most deviations from expert reasoning in the domain of intuition. Our findings underscore the importance of adopting a multidimensional framework for evaluating language model outputs and allow us to provide guidance for model selection and prompting strategies to enhance clinical reasoning performance.
Development and interpretable machine learning models for classification of pancreatic pseudocyst risk in acute pancreatitis
IntroductionPancreatic pseudocysts (PPC) are a late local complication of acute pancreatitis (AP). Persistent PPC carry a high risk of severe outcomes. Existing models, which are

