The MediVoice implementation journey: ambient artificial intelligence for clinical documentation

Healthcare systems are increasingly turning to ambient Artificial Intelligence (AI) scribes to reduce documentation burden and lighten clinicians’ cognitive load. In this brief research report,

Essential Oil-enhanced digital hypnotherapy for subclinical generalized anxiety: a study protocol for a randomized controlled trial

BackgroundSubsyndromal generalized anxiety is highly prevalent and associated with impaired well-being, elevated stress, and functional limitations, yet affected individuals often do not meet criteria for

Promotion and preservation of mobility and autonomy in old age through smart rollators—a qualitative study

BackgroundDiseases and health limitations associated with ageing often result in loss of mobility and reduced social participation. The ongoing demographic shift towards an increasingly ageing

Beyond the algorithm: embedding ethics for trustworthy AI in radiology and oncology

BackgroundArtificial intelligence (AI) in radiology and oncology promises improvements in diagnostic accuracy and efficiency yet introduces complex ethical and societal challenges. Governance efforts frequently rely

Exploration of wearable sensor measures associated with panic attacks differs across mental health conditions

Panic attacks (PAs) are acute anxiety episodes that are pervasive, with one in 10 individuals having experienced a PA in the past year. PAs impair

Locally-deployed vs. cloud-based AI in healthcare: evaluating DeepSeek-R1:8b, DeepSeek-R1, and ChatGPT o3-mini-high for complex medical diagnostics

April 24, 2026

Reasoning large language models are increasingly considered for healthcare-related artificial intelligence applications, but their practical value depends not only on diagnostic accuracy, but also on responsiveness and operational reliability. In this study, we benchmarked six model settings on 1,000 questions from the MedQA dataset: DeepSeek-R1, its distilled 8-billion-parameter local variant DeepSeek-R1:8b, ChatGPT o3-mini-high, and their knowledge-base–augmented counterparts. We evaluated performance across three dimensions: diagnostic accuracy, response latency, and first-attempt connection reliability. DeepSeek-R1 achieved the highest accuracy (89.5%, 95% CI: 87.4–91.2) but showed substantially longer response times (median 26.54 s) and higher connection failure rates (4.6%). ChatGPT o3-mini-high responded faster (median 10.05 s) and showed the most favorable tail-latency profile, but its accuracy (78.2%, 95% CI: 75.5–80.7) was lower than that of DeepSeek-R1. The locally deployed DeepSeek-R1:8b demonstrated markedly stronger connection reliability (failure rate 0.2%, 95% CI: 0.0%–0.5%) but substantially reduced accuracy (55.0%, 95% CI: 51.9%–58.5%). Knowledge-base augmentation did not consistently improve performance; for DeepSeek-R1, it significantly reduced accuracy by 4.36% (p=0.0002), while no significant benefit was observed for the other models. These findings show that reasoning model performance in medical question answering is best understood as a trade-off among accuracy, latency, connection reliability, and deployment mode, and that retrieval augmentation is not universally beneficial. More broadly, this study provides deployment-relevant benchmarking evidence for evaluating reasoning models in healthcare-related settings, while also indicating the need for richer knowledge resources and more realistic task environments before such systems can be meaningfully assessed for real-world clinical use.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844