BackgroundLarge language models (LLMs) have demonstrated strong performance on standardized medical benchmarks. However, their potential in complex surgical decision-making is largely uncharacterized. Critically, human–LLM collaboration regarding the extent to which clinicians can effectively recognize and integrate model-generated reasoning has emerged as an unaddressed question. To address these gaps, we developed a two-phase evaluation framework to simultaneously assess LLM performance and human–LLM collaboration in cardiac surgery.MethodsA panel of senior cardiac surgeons independently developed 15 high-fidelity cardiac surgery scenarios, each paired with a clinically relevant open-ended reasoning task, expert-curated reference answers, and a 10-dimensional weighted evaluation framework. Five representative LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, and Llama3-OpenBioLLM-70B) were prompted using a multi-agent strategy. A separate group of senior surgeons conducted a blinded two-phase evaluation to assess model performance and evaluator judgment shifts: in the first round, they rated LLMs independently; in the second, they were shown the reference answers and invited to revise their ratings, with changes being optional.ResultsLLM performance varied across scenarios, but relative rankings remained stable. Median normalized scores were highest for O1 (0.896), followed by O3-mini-high (0.854), DeepSeek-R1 (0.792), GPT-4 (0.667), and Llama3-OpenBioLLM-70B (0.521). Across evaluation dimensions, scenario comprehension scored highest (0.920), while patient safety (0.507), hallucination avoidance (0.549), and clinical efficiency (0.597) were lowest across models. Second-round normalized scores declined for four LLMs, with 7.57% of ratings revised from affirmative to negative and only 2.59% from negative to affirmative. Among the five highest-weighted evaluation dimensions, 10.16% of second-round ratings were revised from affirmative to negative.ConclusionsReasoning-optimized LLMs outperformed all other models. However, all models exhibited clinical limitations, including poor performance in core evaluation dimensions and scenarios requiring complex, longitudinal reasoning tasks. Overacceptance was the dominant collaboration imbalance, reflecting that clinicians over-accepted model reasoning that appears clinically sound yet is incorrect or potentially harmful. These findings suggest that these LLMs are not yet ready for safe use in complex surgical settings due to both performance limitations and human–LLM collaboration imbalance.
Digital first primary care in NHS England: evaluating alignment with patient-centered care and implications for future practice
The Digital First Primary Care (DFPC) model, introduced by NHS England, aims to enhance healthcare accessibility and efficiency by leveraging digital tools such as telemedicine,

