Depression subtype classification from social media posts: few-shot prompting vs. fine-tuning of large language models

BackgroundSocial media provides timely proxy signals of mental health, but reliable tweet-level classification of depression subtypes remains challenging due to short, noisy text, overlapping symptomatology,

Global Stability Analysis of the Age-Structured Chemostat With Substrate Dynamics

arXiv:2603.25276v1 Announce Type: cross Abstract: In this paper we study the stability properties of the equilibrium point for an age-structured chemostat model with renewal boundary

Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models

arXiv:2603.25495v1 Announce Type: cross Abstract: Accurate short-term air-quality forecasting is essential for public health protection and urban management, yet many recent forecasting frameworks rely on

MindSet: Vision. A toolbox for testing DNNs on key psychological experiments

arXiv:2404.05290v2 Announce Type: replace-cross Abstract: Multiple benchmarks have been developed to assess the alignment between deep neural networks (DNNs) and human vision. In almost all

SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

arXiv:2603.25140v1 Announce Type: cross Abstract: Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained

The Arrival of AGI? When Expert Personas Exceed Expert Benchmarks

March 24, 2026

arXiv:2603.20225v1 Announce Type: cross
Abstract: Do expert personas improve language model performance? The Wharton Generative AI Lab reports that they do not, broadcasting to millions via social media the recommendation that practitioners abandon a technique recommended by Anthropic, Google, and OpenAI. We demonstrate that this null finding was structurally predictable. Five core mechanisms precluded detection before data collection began: baseline contamination elevating the starting point to near-ceiling, system prompt hierarchy subordinating experimental manipulation, impossible expert specifications collapsing to generic competence, format constraints suppressing reasoning processes, and provider exclusion limiting generalizability. Controlled trials correcting these limitations reveal what the original design obscured. To test this, we selected the GPQA Diamond hardest questions to prevent baseline pattern matching, forcing reliance on genuine expert reasoning. On items with valid key answers, expert personas achieve ceiling accuracy. They eliminated all baseline errors through confidence amplification. Furthermore, forensic examination of model divergence identified that half of the hardest GPQA items contain chemically or logically indefensible answers. The model’s CoT revealed reasoning away from impossible answers, yielding penalization for accurate chemistry. These findings recontextualize the original null results. Methodologically sound persona research faces measurement constraints imposed by benchmark validity limitations. Answering the persona question requires evaluation infrastructure the field does not yet possess.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844