Mapping Practice-Based Signals of Generative AI in Psychiatric Care: Qualitative Study of Korean Psychiatrists’ Experiences, Interpretations, and Implementation Priorities

Background: Generative artificial intelligence (GenAI) has increasingly entered psychiatric practice through patient-facing chatbots, self-help tools, and clinician-facing workflow support. Although prior research has examined clinicians’

Usage and Exposure to Content of the NHS Healthy Living Program for People With Type 2 Diabetes: Retrospective Observational Cohort Study

Background: Diabetes self-management education and support (DSMES) programs can improve health outcomes, but engagement is often low. “Healthy Living” is a web-based self-management program for

User-Centered Diabetes Self-Management App (DiabAid Nexus) in Sub-Saharan Africa: Development and Usability Study

Background: Diabetes is a significant global public health concern, with disproportionately high prevalence and poor access to care in low-resource settings, notably in Sub-Saharan Africa.

Opportunities and Concerns of Gamified, Extended Reality for Home-Based Motor Rehabilitation for Children With Brain Injury: Qualitative Case Study on Design Elements Related to the Engagement and Fatigue Perspectives

Background: Acquired brain injuries are injuries that occur after birth and are a leading cause of long-term disability and death in children and young adults.

Policymakers and Researchers Zero In On the Impact of AI Toys

Post Content

Confidence Measurement Metrics in Multimodal Large Language Models for Ultrasound-Based Radiology Cases: Comparative Evaluation Study of Self-Reported, Consistency-Based, and Hybrid Methods

June 2, 2026

Background: Large language models (LLMs) require specialized methodologies to quantify model confidence for safe deployment in health care systems; however, there is a lack of established methods for confidence assessment. Objective: This study aimed to evaluate confidence metrics for multimodal LLMs interpreting ultrasound-based radiology cases and to compare self-reported, consistency-based, and hybrid methods. Methods: From a total of 330 quizzes on the Korean Society of Ultrasound in Medicine digital platform, we selected 94 multiple-choice cases. Four multimodal LLMs were evaluated: 3 reasoning models (GPT-5, Claude-4.5-Sonnet, and Gemini-3-Pro) and 1 general model (GPT-4o). Temperature was fixed at 1.0. Multiple confidence metrics were assessed: (1) self-reported metrics generated by LLMs using prompts that elicited direct confidence percentages with answers, including first self-reported confidence and mean self-reported confidence; (2) consistency-based metrics derived from 20 repeated outputs per case, including relative entropy calculated as 1 − H/log k (H=Shannon entropy, k=number of answer choices) and majority-vote percentage; and (3) a Top Weighted Score combining response frequency with self-reported confidence. Receiver operating characteristic analysis for discrimination and Spearman correlation between accuracy and each confidence metric was conducted. Additionally, model calibration was assessed using expected calibration error and Brier score. Processing time and token consumption (input, output, and total) were recorded for each application programming interface call to evaluate resource use across models. Results: Diagnostic accuracy varied across models, with Gemini-3-Pro achieving the highest accuracy (70/94, 74.47%), surpassing the median human accuracy (59%, IQR 40.3%-75%). Top Weighted Score, a hybrid metric combining response frequency and self-reported confidence, was the only metric achieving statistically significant correlations across all 4 models: Gemini-3-Pro (ρ=0.52), GPT-5 (ρ=0.43), Claude-4.5-Sonnet (ρ=0.30), and GPT-4o (ρ=0.22). Receiver operating characteristic analysis revealed that Top Weighted Score demonstrated the highest discriminative ability, with area under the curve values of 0.826 (95% CI 0.731‐0.920) for Gemini-3-Pro and 0.767 (95% CI 0.668‐0.866) for GPT-5. Top Weighted Score was the only metric achieving statistical significance in GPT-4o. Calibration analysis showed that Top Weighted Score achieved the lowest expected calibration error in GPT-5 (0.098) and Claude-4.5-Sonnet (0.192), while Gemini-3-Pro showed comparable calibration between relative entropy (0.119) and Top Weighted Score (0.122). Resource use analysis demonstrated that reasoning models required substantially longer processing times and higher token consumption compared to general models. Conclusions: In multimodal LLMs applied to ultrasound-based radiology cases, hybrid methods (Top Weighted Score) demonstrated significant associations across all evaluated models and appear to serve as more reliable indicators of diagnostic confidence compared to self-reported or consistency-based metrics alone, although the strength of these associations varied across models, and external validation is warranted before broader clinical application. These findings support integrative confidence estimation approaches that incorporate response consistency while highlighting the need for resource-efficient sampling strategies to enable practical clinical deployment.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844