Advancing the adoption of oncology decision support tools in Europe: insights from CAN.HEAL

Effective cancer care increasingly depends on digital decision support tools (DSTs) to interpret complex clinical, molecular, and genomic data and guide personalised treatment decisions. However,

Early Type 2 diabetes risk prediction using explainable machine learning in a two-stage approach

BackgroundDiabetes is a chronic disease characterized by elevated blood glucose levels. Without early detection and proper management, it can lead to serious complications and increase

Practical templates for digital health ethics applications in Sweden: lessons from a sensor-based monitoring study

Obtaining ethical approval for digital health research involving vulnerable populations presents significant challenges for researchers, particularly when navigating complex regulatory frameworks like Sweden’s ethical review

Enhanced meta ensemble stacking approach with XGBoost and optuna based detection of Parkinson’s disease

Parkinson’s disease (PD), a progressive neurological disorder affecting motor function, has been significantly rising in prevalence in recent years. Current diagnostic methods, relying on clinical

Editorial: Advancing digital mental health for youth

Post Content

Comparative performance evaluation of large language models in answering esophageal cancer-related questions: a multi-model assessment study

October 15, 2025

BackgroundEsophageal cancer has high incidence and mortality rates, leading to increased public demand for accurate information. However, the reliability of online medical information is often questionable. This study systematically compared the accuracy, completeness, and comprehensibility of mainstream large language models (LLMs) in answering esophageal cancer-related questions.MethodsIn total, 65 questions covering fundamental knowledge, preoperative preparation, surgical treatment, and postoperative management were selected. Each model, namely, ChatGPT 5, Claude Sonnet 4.0, DeepSeek-R1, Gemini 2.5 Pro, and Grok-4, was queried independently using standardized prompts. Five senior clinical experts, including three thoracic surgeons, one radiologist, and one medical oncologist, evaluated the responses using a five-point Likert scale. A retesting mechanism was applied for the low-scoring responses, and intraclass correlation coefficients were used to assess the rating consistency. The statistical analyses were conducted using the Friedman test, the Wilcoxon signed-rank test, and the Bonferroni correction.ResultsAll the models performed well, with average scores exceeding 4.0. However, the following significant differences emerged: Gemini excelled in accuracy, while ChatGPT led in completeness, particularly in surgical and postoperative contexts. Minor differences appeared in fundamental knowledge, but notable disparities were found in complex areas. Retesting showed improvements in overall quality, yet some responses showed decreased completeness and relevance.ConclusionLarge language models have considerable potential in answering questions about esophageal cancer, with significant differences in completeness. ChatGPT is more comprehensive in complex scenarios, while Gemini excels in accuracy. This study offers guidance for selecting artificial intelligence tools in clinical settings, advocating for a tiered application strategy tailored to specific scenarios and highlighting the importance of user education to understand the limitations and applicability of LLMs.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844