Unlocking electronic health records: a hybrid graph RAG approach to safe clinical AI for patient QA

IntroductionElectronic health record (EHR) systems present clinicians with vast repositories of clinical information, creating a significant cognitive burden where critical details are easily overlooked. While

Virtual reality in treatment of psychological disorders: a systematic review

ObjectiveThe paper aims to systematically review the literature on the efficacy of virtual reality (VR) based therapies to treat mental health disorders in Randomized Control

Real-world federated learning for brain imaging scientists

BackgroundFederated learning (FL) has the potential to boost deep learning in neuroimaging but is rarely deployed in real-world scenarios, where its true potential lies. We

Through the looking glass: ethical considerations regarding LLM-induced hallucinations to medical questions

Post Content

How physicians embrace AI: insights from technology acceptance and trust theories

ObjectiveThis study investigates the factors influencing physicians’ acceptance and adoption of artificial intelligence (AI) technologies in clinical practice, integrating the Theory of Planned Behavior (TPB)

Enhancing Detection of Message Intents in a Mobile Health Smoking-Cessation Intervention Using Large Language Model Fine-Tuning, Data Downsampling, and Error Correction: Algorithm Development and Validation

March 9, 2026

Background: Although smoking-cessation aids such as support groups and nicotine replacement therapy (NRT) can help people quit, quit rates remain low. Mobile health interventions can boost accessibility and engagement, especially with NRT, but require ongoing effort to deliver timely responses. Accurate intent detection is crucial for identifying user needs and delivering timely, appropriate chatbot responses. Recent large language model advancements in natural language processing and artificial intelligence (AI) have shown promise. However, these systems often struggle with many intent categories, complex language, and imbalanced data, reducing recognition accuracy. Objective: The main goal of this study was to develop an AI tool, a large language model that could accurately detect people’s message intents, despite dataset imbalances and complexities. In our application, the messages came from a smoking-cessation support-group intervention and often involved the use of NRT provided as part of that intervention. Methods: We consistently used a state-of-the-art public domain large language model, Llama-3 8B (8 billion parameters) from Meta. First, we used the model off-the-shelf. Second, we fine-tuned it on our annotated dataset with 25 intent categories. Third, we also downsampled the predominant intent category to reduce model bias. Finally, we combined downsampling with corrected human annotations, creating a cleaned dataset for a new round of fine-tuning. Results: Without fine-tuning, the model achieved unweighted and weighted -scores (overall performance) of 0.41 and 0.38, respectively, on the downsampled corrected test dataset, and 0.29 and 0.35 on the full test dataset. Fine-tuning improved performance to 0.77 and 0.80 on the downsampled corrected dataset, and 0.72 and 0.86 on the full dataset. Fine-tuning with downsampling attained the best -scores, 0.88 and 0.91 on the downsampled corrected dataset, though performance dropped on the full test dataset (0.58 unweighted, 0.66 weighted) due to the predominance of the off-topic intent category, while unweighted recall remained high (0.80). The final method combining fine-tuning, downsampling, and error correction achieved 0.86 unweighted and 0.90 weighted -scores on the downsampled corrected dataset, and 0.57 and 0.65 on the full dataset with unweighted recall improving to 0.82. Conclusions: Large language models performed poorly without fine-tuning, highlighting the need for domain-specific training. Even with fine-tuning, performance was limited by a highly imbalanced dataset. Downsampling before fine-tuning moderately improved performance but still left room for improvement and concerns about dataset noise. A careful review of model-human disagreement cases helped identify human annotation errors. After error correction, the method without error correction still achieved slightly higher precision and -score on the corrected test dataset. While error correction slightly improved recall on noisy data, automated downsampling alone may be sufficient, making manual correction a more resource-intensive option with limited added benefit.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844