• Home
  • Uncategorized
  • Depression subtype classification from social media posts: few-shot prompting vs. fine-tuning of large language models

BackgroundSocial media provides timely proxy signals of mental health, but reliable tweet-level classification of depression subtypes remains challenging due to short, noisy text, overlapping symptomatology, and labeling bias. Large language models (LLMs) are increasingly used in mental health for tasks such as symptom extraction, risk screening, and triage, yet their reliability for fine-grained depression subtype classification from brief social media posts remains underexplored.ObjectiveWe benchmarked few-shot, prompt-only LLMs against parameter-efficient fine-tuned encoders for identifying depression subtypes in posts on X (formerly Twitter).MethodsWe used a curated dataset of 14,983 English-language tweets stratified into six clinically grounded categories: five depression subtypes (postpartum, major, bipolar, psychotic, atypical) and a no-depression class. We compared (i) instruction-tuned causal LLMs in a few-shot setting and (ii) supervised fine-tuning of transformer encoders (e.g., RoBERTa, DeBERTa, BERTweet) under identical splits and metrics. The primary evaluation metric was macro-F1 (with accuracy, precision, recall as secondary). We also report per-class precision, recall, and F1 scores, along with confusion matrices, for the best-performing model from each model family.ResultsFew-shot LLMs achieved macro-F1 = 0.73–0.77 (best: Llama-3-8B, accuracy 0.75). Fine-tuned encoders consistently outperformed prompt-only models, reaching macro-F1 = 0.94–0.96 (best: RoBERTa-large, accuracy 0.954). Relative improvements were largest for the clinically challenging classes. Fine-tuning increased F1 for postpartum and psychotic subtypes to ≈0.99 (substantially above few-shot) and boosted major-depression recall from ≈0.53–0.60 to ≈0.95–0.97. Error analyses showed prompt-only models frequently misclassified major and atypical depression as bipolar, patterns substantially reduced by fine-tuning.ConclusionsOn tweet-level depression subtyping, task-specific adaptation via fine-tuning yields substantially higher and more stable performance than few-shot prompting, particularly for nuanced, clinically anchored classes. These findings recommend fine-tuned encoders as strong, compute-efficient baselines for depression subtype classification from social media.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844