Promotion and preservation of mobility and autonomy in old age through smart rollators—a qualitative study

BackgroundDiseases and health limitations associated with ageing often result in loss of mobility and reduced social participation. The ongoing demographic shift towards an increasingly ageing

Trustworthy intelligent rooms: integrating blockchain, federated learning, and data-centric AI for healthcare 4.0

IntroductionIntelligent room systems are experiencing a surge in demand within the Healthcare 4.0 ecosystem. The integration of Federated Learning (FL) and Data-Centric AI has led

AI-driven mental health decision support linked to clinician resilience and preparedness

ObjectivesMental health services are facing unprecedented demand, placing significant pressure on clinicians to conduct timely and effective patient assessments. Rising staff turnover and burnout threatens

Big data integration for enhanced epidemiological research: insights and directions from NHLBI’s workshop

The landscape of epidemiological research is experiencing a technological transformation, driven by the rapid expansion of big data and advancements in artificial intelligence (AI) and

Construction of patient trajectories to model clinical trial outcomes: application to myasthenia gravis

IntroductionAccurate prediction of patient outcomes in clinical trials is crucial for the timely assessment of treatment efficacy. This study proposes a novel approach to predict

Assessment of frontier Large Language Models in sleep medicine

April 29, 2026

Study objectivesTo evaluate and compare the performance of two proprietary frontier large language models (LLMs), ChatGPT-5 and Grok-4, on diagnostic reasoning and foundational knowledge tasks within the specialty domain of sleep medicine.MethodsThe models were evaluated on two tasks: case-based reasoning using 79 clinical vignettes from the AASM Case Book of Sleep Medicine and knowledge assessment using 897 multiple-choice questions (MCQs) from board review materials. For vignettes, final diagnosis was scored by concept-level exact match, and differential diagnosis (DDx) was scored on a fixed top-5 output using concept-level matching with synonym normalization to compute precision, recall, and F1-score. MCQ performance was the proportion correct. Inter-model performance was compared using the Mann–Whitney U test.ResultsBoth models achieved high accuracy for final diagnosis (92.4% for both; 95% CI 86.4, 98.4) and MCQs (ChatGPT-5: 93.0%; Grok-4: 92.8%). However, performance on generating a comprehensive differential diagnosis was suboptimal, with modest F1-scores for both ChatGPT-5 (0.55 ± 0.20) and Grok-4 (0.59 ± 0.20). There were no statistically significant differences in performance between the two models across any metric (p > 0.05).ConclusionsFrontier LLMs demonstrated high accuracy in sleep medicine tasks requiring knowledge recall and direct pattern recognition but showed more limited performance in complex clinical reasoning tasks such as generating a comprehensive differential diagnosis. These findings suggest that current general-purpose models may be more reliable for focused knowledge support than for broad hypothesis generation. Future studies should evaluate whether domain-adapted models or clinician-in-the-loop workflows can improve real-world diagnostic usefulness and safety.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844