Development and interpretable machine learning models for classification of pancreatic pseudocyst risk in acute pancreatitis

IntroductionPancreatic pseudocysts (PPC) are a late local complication of acute pancreatitis (AP). Persistent PPC carry a high risk of severe outcomes. Existing models, which are

Implementing AI innovation in radiology departments in the English NHS: a qualitative study on the experiences of professionals, patient groups and innovators

IntroductionDigital solutions and Artificial Intelligence (AI) innovations are often presented as the answer to many challenges faced by healthcare systems around the world. The UK

Development and Evaluation of a Hallucination Awareness Scale for Healthcare Professionals and its impact on diagnostic confidence

Generative artificial intelligence (Gen AI) has gained immense significance in recent years, particularly in the field of healthcare. Despite its significant role in streamlining healthcare-related

Planning and delivering co-creation workshops: practical lessons from digital health device design

Co-creation methods are increasingly recognised as essential in digital health and care, yet engineers and physical scientists new to the field often find the literature

Promises and challenges of applying large language models in the healthcare domain

Large language models are rapidly moving from theoretical concepts to active clinical pilots. Current approaches diverge between general-purpose models, which adapt to healthcare via prompt

Evaluating large language models for automated TNM staging from PET-CT reports: a multi-cancer comparative study

March 11, 2026

PurposeTo evaluate three large language models (LLMs), including ChatGPT 5, ChatGPT 4o, and ChatGPT 3.5, in automating TNM staging from PET-CT reports across six cancer types, and to assess their clinical utility compared with junior radiologists.Materials and methodsPET-CT reports from 552 treatment-naive patients in two institutions with confirmed primary malignancies (lung, breast, liver, pancreatic, renal, and prostate cancer) were analyzed. Three ChatGPT-series LLMs and five junior radiologists independently performed TNM staging. Reference standards were established by two senior radiologists according to the 8th version of American Joint Committee on Cancer (AJCC) staging system. Performance was evaluated using accuracy rates. Intra-model agreement was assessed by repeating each model three times per report with identical prompts, and inter-model agreement was evaluated using Cohen’s κ coefficients.ResultsChatGPT 5 achieved the highest overall accuracy (82.1%, 453/552), followed by ChatGPT 4o (74.3%, 410/552), both significantly outperforming ChatGPT 3.5 (59.6%, 329/552) and junior radiologists (77.0%, 425/552; p = 0.041 for ChatGPT 5 vs. junior radiologists). Accuracy varied by cancer type, with the highest performance in lung cancer staging (88.5%) and the lowest in pancreatic cancer (69.2%). Across TNM categories, all models achieved the best performance in T staging, followed by N staging, with M staging remaining the most challenging. ChatGPT 5 showed near-perfect intra-model agreement (κ = 0.96), while inter-model agreement ranged from moderate between ChatGPT 3.5 and 4o (κ = 0.58) to substantial between ChatGPT 5 and 4o (κ = 0.78). ChatGPT 5 processed cases markedly faster than junior radiologists (8.3 ± 3.2 vs. 92.5 ± 21.7 s per case; p < 0.001).ConclusionAmong the three LLMs, ChatGPT 5 demonstrated the highest accuracy, stability, and efficiency in automated TNM staging from PET-CT reports, achieving performance comparable to or slightly exceeding junior radiologists. Its advantages in T staging and lung cancer evaluation highlight its clinical utility as a potential decision-support tool.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844