Digital first primary care in NHS England: evaluating alignment with patient-centered care and implications for future practice

The Digital First Primary Care (DFPC) model, introduced by NHS England, aims to enhance healthcare accessibility and efficiency by leveraging digital tools such as telemedicine,

The economics of digitally integrated wellness services in heritage regions

Wellness tourism is among the fastest-growing segments of the global health economy, yet its development in Central Asian heritage regions remains constrained by fragmented service

Screening anxiety via contrastive autobiographical recall

IntroductionLanguage offers a low-burden and scalable pathway for digital anxiety screening, particularly in telehealth or repeated-monitoring settings where spontaneous speech may already be available. This

DFU-GCNet: a global context-enhanced inception network for robust and interpretable diabetic foot ulcer classification

IntroductionDiabetic foot ulcers (DFUs) are severe complications that cause frequent lower extremity amputations. Timely diagnosis is crucial for effective clinical management. Although deep learning approaches

Resources consumption and environmental impacts of the DYNAMIC digital health intervention aimed at improving quality of care for sick children in Tanzania: a life cycle assessment

BackgroundHealth systems contribute to an important part of planetary boundaries overshoot, the effect of its rapid digitalization being however not well known. DYNAMIC is a

Blinded two-phase evaluation of large language models in complex cardiac surgery: task-specific performance and human-AI collaboration

June 1, 2026

BackgroundLarge language models (LLMs) have demonstrated strong performance on standardized medical benchmarks. However, their potential in complex surgical decision-making is largely uncharacterized. Critically, human–LLM collaboration regarding the extent to which clinicians can effectively recognize and integrate model-generated reasoning has emerged as an unaddressed question. To address these gaps, we developed a two-phase evaluation framework to simultaneously assess LLM performance and human–LLM collaboration in cardiac surgery.MethodsA panel of senior cardiac surgeons independently developed 15 high-fidelity cardiac surgery scenarios, each paired with a clinically relevant open-ended reasoning task, expert-curated reference answers, and a 10-dimensional weighted evaluation framework. Five representative LLMs (O1, O3-mini-high, DeepSeek-R1, GPT-4, and Llama3-OpenBioLLM-70B) were prompted using a multi-agent strategy. A separate group of senior surgeons conducted a blinded two-phase evaluation to assess model performance and evaluator judgment shifts: in the first round, they rated LLMs independently; in the second, they were shown the reference answers and invited to revise their ratings, with changes being optional.ResultsLLM performance varied across scenarios, but relative rankings remained stable. Median normalized scores were highest for O1 (0.896), followed by O3-mini-high (0.854), DeepSeek-R1 (0.792), GPT-4 (0.667), and Llama3-OpenBioLLM-70B (0.521). Across evaluation dimensions, scenario comprehension scored highest (0.920), while patient safety (0.507), hallucination avoidance (0.549), and clinical efficiency (0.597) were lowest across models. Second-round normalized scores declined for four LLMs, with 7.57% of ratings revised from affirmative to negative and only 2.59% from negative to affirmative. Among the five highest-weighted evaluation dimensions, 10.16% of second-round ratings were revised from affirmative to negative.ConclusionsReasoning-optimized LLMs outperformed all other models. However, all models exhibited clinical limitations, including poor performance in core evaluation dimensions and scenarios requiring complex, longitudinal reasoning tasks. Overacceptance was the dominant collaboration imbalance, reflecting that clinicians over-accepted model reasoning that appears clinically sound yet is incorrect or potentially harmful. These findings suggest that these LLMs are not yet ready for safe use in complex surgical settings due to both performance limitations and human–LLM collaboration imbalance.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844