Background: Although smoking-cessation aids such as support groups and nicotine replacement therapy (NRT) can help people quit, quit rates remain low. Mobile health interventions can boost accessibility and engagement, especially with NRT, but require ongoing effort to deliver timely responses. Accurate intent detection is crucial for identifying user needs and delivering timely, appropriate chatbot responses. Recent large language model advancements in natural language processing and artificial intelligence (AI) have shown promise. However, these systems often struggle with many intent categories, complex language, and imbalanced data, reducing recognition accuracy. Objective: The main goal of this study was to develop an AI tool, a large language model that could accurately detect people’s message intents, despite dataset imbalances and complexities. In our application, the messages came from a smoking-cessation support-group intervention and often involved the use of NRT provided as part of that intervention. Methods: We consistently used a state-of-the-art public domain large language model, Llama-3 8B (8 billion parameters) from Meta. First, we used the model off-the-shelf. Second, we fine-tuned it on our annotated dataset with 25 intent categories. Third, we also downsampled the predominant intent category to reduce model bias. Finally, we combined downsampling with corrected human annotations, creating a cleaned dataset for a new round of fine-tuning. Results: Without fine-tuning, the model achieved unweighted and weighted -scores (overall performance) of 0.41 and 0.38, respectively, on the downsampled corrected test dataset, and 0.29 and 0.35 on the full test dataset. Fine-tuning improved performance to 0.77 and 0.80 on the downsampled corrected dataset, and 0.72 and 0.86 on the full dataset. Fine-tuning with downsampling attained the best -scores, 0.88 and 0.91 on the downsampled corrected dataset, though performance dropped on the full test dataset (0.58 unweighted, 0.66 weighted) due to the predominance of the off-topic intent category, while unweighted recall remained high (0.80). The final method combining fine-tuning, downsampling, and error correction achieved 0.86 unweighted and 0.90 weighted -scores on the downsampled corrected dataset, and 0.57 and 0.65 on the full dataset with unweighted recall improving to 0.82. Conclusions: Large language models performed poorly without fine-tuning, highlighting the need for domain-specific training. Even with fine-tuning, performance was limited by a highly imbalanced dataset. Downsampling before fine-tuning moderately improved performance but still left room for improvement and concerns about dataset noise. A careful review of model-human disagreement cases helped identify human annotation errors. After error correction, the method without error correction still achieved slightly higher precision and -score on the corrected test dataset. While error correction slightly improved recall on noisy data, automated downsampling alone may be sufficient, making manual correction a more resource-intensive option with limited added benefit.
Unlocking electronic health records: a hybrid graph RAG approach to safe clinical AI for patient QA
IntroductionElectronic health record (EHR) systems present clinicians with vast repositories of clinical information, creating a significant cognitive burden where critical details are easily overlooked. While



