BackgroundArtificial intelligence technologies are being actively introduced in clinical practice. The most promising solutions are AI-assistants based on large language models (LLMs). Determining the feasibility of integrating such applications in clinical practice requires independent performance assessments. This study assessed accuracy of several multimodal LLMs in detecting pulmonary nodules on chest radiographs (CXR).MethodsThis study included 9 models: Llama 3.2 Vision 90B, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Gemini 2.0 Pro Experimental, Perplexity, CXR-LLaVA, XrayGPT, BiomedCLIP, MedRAX. Each model determined presence or absence of pulmonary nodules in dataset containing 100 CXR, 50 of which contained pulmonary nodules. ROC curves were constructed, diagnostic accuracy metrics were calculated. McNemar’s test was used for pairwise accuracy comparisons.ResultsBest results were achieved by MedRAX framework and BiomedCLIP vision-language model, with accuracy of 0.711 (95% CI 0.613–0.808). Among proprietary single-model LLMs, Claude 3.7 Sonnet demonstrated the best performance: accuracy 0.651 (0.548–0.753). Llama 3.2 Vision 90B, Claude 3.5 Sonnet, Gemini 2.0 Pro Experimental demonstrated matching accuracy values: 0.602 (0.497–0.708).ConclusionMedRAX framework and BiomedCLIP vision-language model showed the highest accuracy values. No statistically significant difference was observed between proprietary and open-source models, which may indicate potential for improving accuracy through refinement of open-source LLM-based models. Overall, accuracy values of evaluated models were insufficient for current clinical practice implementation. These results should be seen as exploratory given the small dataset size, single-centre design, different prompting strategies for foundation and domain-adapted models and use of PNG images instead of DICOM.
Epistemic and ethical limits of large language models in evidence-based medicine: from knowledge to judgment
BackgroundThe rapid evolution of general large language models (LLMs) provides a promising framework for integrating artificial intelligence into medical practice. While these models are capable