Background: “I’m not a doctor, but…” is a typical response when asking considerate laypeople for health advice. However, seeking medical advice has also shifted to digital settings, where the expertise of the other party is less transparent than in face-to-face interactions. Recently, large language models (LLMs) have emerged as easily accessible tools, offering a novel way to formulate medical questions and receive seemingly qualified advice. Given the sensitive nature of health-related queries and the lack of professional supervision, incorrect advice can pose serious health risks. Therefore, including explicit disclaimers and precise referrals in LLM responses to medical queries is crucial. However, little is known about how LLMs adapt their safety implementations in response to different urgency levels. Objective: This study evaluates disclaimer and referral patterns in responses from LLMs to authentic medical queries of different urgency levels using a systematic evaluation framework. Methods: This prospective, multimodel evaluation study generated and analyzed 908 responses from 4 popular LLMs (GPT-4o, Claude Sonnet-4, Grok-3, and DeepSeek-V3) to 227 authentic patient queries from a public dataset. Two human raters classified all 227 patient queries using a 3-level urgency scale. LLM responses were evaluated using a 5-point ordinal classification system for disclaimer and referral advice, ranging from “no disclaimer” to “urgent advice to consult a medical professional.” GPT-4o served as the primary rater model for this task after conducting a subset validation against human expert annotations. Statistical analyses included Jonckheere-Terpstra tests to examine the relationship between case urgency and disclaimer ratings and Kruskal-Wallis tests for intermodel comparisons. Results: The 227 patient queries were distributed as 77 (34%) low-urgency, 110 (48%) intermediate-urgency, and 40 (18%) high-urgency cases. All 4 LLMs demonstrated statistically significant ordered trends (all <.001), with higher-urgency queries receiving more explicit referral advice. Disclaimer and referral advice clustered toward higher categories across all models, with 97% (881/908) of responses indicating that a medical professional should be consulted. Sonnet-4, Grok-3, and GPT-4o demonstrated a conservative approach, with 89%, 89%, and 88%, respectively, of their responses being either explicit or urgent referrals. In contrast, DeepSeek-V3 showed a broader distribution, with 74% of responses falling into these categories. Interrater reliability between GPT-4o and human raters achieved moderate to substantial agreement, with weighted Cohen κ values between 0.415 and 0.707. Conclusions: Current LLMs exhibit urgency-responsive safety mechanisms when providing medical advice. All evaluated models adaptively incorporate more explicit disclaimers and urgent referrals for higher-urgency queries. However, variability between LLMs highlights the need for standardized safety measures and appropriate regulatory frameworks. Although these findings indicate progress regarding safety concerns, the public availability of LLMs requires careful consideration to ensure consistent protection against patient harm while preserving the benefits of low-threshold access to health information.
Unlocking electronic health records: a hybrid graph RAG approach to safe clinical AI for patient QA
IntroductionElectronic health record (EHR) systems present clinicians with vast repositories of clinical information, creating a significant cognitive burden where critical details are easily overlooked. While



