Effectiveness of Al-Assisted Patient Health Education Using Voice Cloning and ChatGPT: Prospective Randomized Controlled Trial

Background: Traditional patient education often lacks personalization and engagement, potentially limiting knowledge acquisition and treatment adherence. Advances in artificial intelligence (AI), including voice cloning technology

Guide on Selection of Optimal Motivational Themes for Use in a Clinical Trial Recruiting Black US Adults: Survey Study

Background: Black adults in the United States face significant cardiovascular health disparities, which are likely exacerbated by the underrepresentation of Black adults in cardiovascular clinical

The Right to Understand in Health Care AI

Post Content

Translating Telehealth Communication Research Into Patient-Centered, Implementable Practice

Understanding both patient and clinician perspectives on communication challenges in virtual primary care consultations is important to ensure safe and effective care. This commentary reviews

Telemedicine Adoption for Managing Chronic and Rare Diseases in Indonesia During and Beyond the COVID-19 Era: Qualitative Study

Background: Telemedicine has emerged as a valuable tool for improving health care delivery, especially in low-resource and geographically isolated regions. In Indonesia, the COVID-19 pandemic

Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

March 17, 2026

arXiv:2603.11413v2 Announce Type: replace-cross
Abstract: Ramaswamy et al. reported in textitNature Medicine that ChatGPT Health under-triages 51.6% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol — forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions — that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors’ released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points ($p = 0.015$). Diabetic ketoacidosis was correctly triaged in 100% of trials across all models and conditions. Asthma triage improved from 48% to 80%. The forced A/B/C/D format was the dominant failure mechanism: three models scored 0–24% with forced choice but 100% with free text (all $p < 10^-8$), consistently recommending emergency care in their own words while the forced-choice format registered under-triage. Prompt-faithful checks on the authors’ exact released prompts confirmed the scaffold produces model-dependent, case-dependent results. The headline under-triage rate is highly contingent on evaluation format and should not be interpreted as a stable estimate of deployed triage behavior. Valid evaluation of consumer health AI requires testing under conditions that reflect actual use.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844