• Home
  • DTx
  • Performance of large language models and prompt engineering strategies for data extraction in systematic reviews

BackgroundSystematic reviews depend on manual data extraction and synthesis, which are time-consuming and prone to human error. Although large language models (LLMs) have the potential to automate parts of this process, their accuracy, reproducibility, and efficiency across different models and prompt strategies remain insufficiently characterized.MethodsThis study evaluated the performance of three LLMs, including ChatGPT-4o, Claude 3 Sonnet, and Gemini 1.5 Pro, for data extraction from trials addressing five clinical questions (CQs) in the Japanese Clinical Practice Guidelines for the Management of Sepsis and Septic Shock 2024 (J-SSCG 2024). Using portable document format files of eligible studies, LLMs extracted predefined background characteristics and clinical outcomes. Outputs generated using an original prompt were compared with those produced using chain-of-thought and self-reflection (SR) prompt strategies. Two independent reviewers assessed accuracy against a reference standard derived from manual extraction by the guideline members. Inter-session consistency across three sessions and processing time were also evaluated.ResultsFor background data extraction, mean no-error proportions ranged from 81.6% (ChatGPT-4o) to 92.4% (Claude 3 Sonnet) across models. For outcome data extraction, mean no-error proportions ranged from 27.8% (Gemini 1.5 Pro) to 80.7% (Claude 3 Sonnet). Missing or incorrect values accounted for most extraction errors, whereas fabricated outputs were relatively uncommon. Prompt engineering strategies resulted in only modest changes in extraction accuracy across models. Inter-session consistency ranged from 76.3% (ChatGPT-4o) to 91.3% (Gemini 1.5 Pro) for background data extraction and from 44.8% (ChatGPT-4o) to 65.6% (Claude 3 Sonnet) for outcome data extraction. Mean processing times ranged from 29.2 to 39.7 s per article for background data extraction and from 19.3 to 46.3 s for outcome data extraction using standard prompts. When SR prompts were used, processing times increased to 59.0 to 97.7 s for background data extraction and to 52.7 to 107.1 s for outcome data extraction.ConclusionsLLMs can reliably support background data extraction in systematic reviews. However, outcome data extraction remains challenging, emphasizing the continued need for human oversight. Extraction performance varied across models and prompt engineering strategies.Clinical Trial Registration: The study was registered in the University Hospital Medical Information Network (UMIN) clinical trials registry, identifier (UMIN000054461).

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844