• Home
  • Uncategorized
  • Privacy-preserving augmentation of structured telehealth activity data in diabetes patients using natural language processing

IntroductionDiabetes management increasingly relies on telehealth platforms in which patients generate structured and unstructured data. This unstructured data, in the form of free-text notes often contain additional information beyond the structured data. Extracting this information can enhance patient profiles and optimize treatment. In particular, the extraction of physical activity information from these notes is considered important. This study evaluates rule-based/regex algorithms and a locally deployed Mistral LLM for physical activity information extraction and data augmentation, with their performances benchmarked against a state-of-the-art GPT-4.1.MethodsData from 943 patients collected over 12 years in the DiabMemory system, supplemented by 100 synthetic notes, were analyzed. Patients’ privacy was preserved by applying a free text pseudonymization algorithm to all notes and by using locally deployed approaches, thereby avoiding third-party cloud services. Three tasks were conducted: (1) extraction of physical activity (PA) data from free-text notes using regex and a locally deployed Mistral LLM, (2) integration of extracted data with structured activity records using a rule-based approach and the local Mistral LLM, and (3) benchmarking local approaches against GPT-4.1 based on the synthetic notes.ResultsBoth local methods achieved strong performance in task 1, with minimum F1-scores of 0.84. In task 2, rule-based augmentation (F1 = 0.73) surpassed the Mistral LLM (F1 = 0.37). Task 3 showed GPT-4.1 outperforming the local LLM but not consistently surpassing regex. The rule-based algorithms also required substantially less computation time than either LLM.DiscussionThe regex algorithm achieved superior accuracy and efficiency but required extensive dataset-specific development, while prompt engineering for the LLM required less knowledge and the development time for regex exceeded that of LLM prompt engineering. Findings of this work generally align with prior studies but are limited by the rather small test set and use of synthetic data.ConclusionsLocal NLP approaches can enhance structured PA data in diabetes telehealth. Rule-based algorithms remain a strong option where computational resources are limited, though future work should validate these findings on larger and more diverse datasets.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844