Reasoning large language models are increasingly considered for healthcare-related artificial intelligence applications, but their practical value depends not only on diagnostic accuracy, but also on responsiveness and operational reliability. In this study, we benchmarked six model settings on 1,000 questions from the MedQA dataset: DeepSeek-R1, its distilled 8-billion-parameter local variant DeepSeek-R1:8b, ChatGPT o3-mini-high, and their knowledge-base–augmented counterparts. We evaluated performance across three dimensions: diagnostic accuracy, response latency, and first-attempt connection reliability. DeepSeek-R1 achieved the highest accuracy (89.5%, 95% CI: 87.4–91.2) but showed substantially longer response times (median 26.54 s) and higher connection failure rates (4.6%). ChatGPT o3-mini-high responded faster (median 10.05 s) and showed the most favorable tail-latency profile, but its accuracy (78.2%, 95% CI: 75.5–80.7) was lower than that of DeepSeek-R1. The locally deployed DeepSeek-R1:8b demonstrated markedly stronger connection reliability (failure rate 0.2%, 95% CI: 0.0%–0.5%) but substantially reduced accuracy (55.0%, 95% CI: 51.9%–58.5%). Knowledge-base augmentation did not consistently improve performance; for DeepSeek-R1, it significantly reduced accuracy by 4.36% (p=0.0002), while no significant benefit was observed for the other models. These findings show that reasoning model performance in medical question answering is best understood as a trade-off among accuracy, latency, connection reliability, and deployment mode, and that retrieval augmentation is not universally beneficial. More broadly, this study provides deployment-relevant benchmarking evidence for evaluating reasoning models in healthcare-related settings, while also indicating the need for richer knowledge resources and more realistic task environments before such systems can be meaningfully assessed for real-world clinical use.
The MediVoice implementation journey: ambient artificial intelligence for clinical documentation
Healthcare systems are increasingly turning to ambient Artificial Intelligence (AI) scribes to reduce documentation burden and lighten clinicians’ cognitive load. In this brief research report,

