Depression subtype classification from social media posts: few-shot prompting vs. fine-tuning of large language models

BackgroundSocial media provides timely proxy signals of mental health, but reliable tweet-level classification of depression subtypes remains challenging due to short, noisy text, overlapping symptomatology,

Educating Students About Digital Health Research Ethics: Curricula Review and Expert Interview Study

Background: The rapid growth of digital health research, involving wearable devices, mobile apps, and sociotechnical health systems, raises complex ethical, legal, and social considerations. While

Patient Sharing of Digital Health Data in the Veterans Health Administration: Cross-Sectional Analysis

Background: The integration of patient-generated health data (PGHD) into health care has the potential to significantly transform patient care and clinical practice. PGHD includes health-related

Enhancing Efficiency and Performance in Deepfake Audio Detection through Neuron-level dropin & Neuroplasticity Mechanisms

arXiv:2603.24343v1 Announce Type: cross Abstract: Current audio deepfake detection has achieved remarkable performance using diverse deep learning architectures such as ResNet, and has seen further

A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English

arXiv:2603.24549v1 Announce Type: cross Abstract: Automatic Speech Recognition (ASR) systems are widely used in everyday communication, education, healthcare, and industry, yet their performance remains uneven

Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

March 26, 2026

arXiv:2603.23682v1 Announce Type: cross
Abstract: The rapid adoption of large language models (LLMs) in education raises profound challenges for assessment design. To adapt assessments to the presence of LLM-based tools, it is crucial to characterize the strengths and weaknesses of LLMs in a generalizable, valid and reliable manner. However, current LLM evaluations often rely on descriptive statistics derived from benchmarks, and little research applies theory-grounded measurement methods to characterize LLM capabilities relative to human learners in ways that directly support assessment design. Here, by combining educational data mining and psychometric theory, we introduce a statistically principled approach for identifying items on which humans and LLMs show systematic response differences, pinpointing where assessments may be most vulnerable to AI misuse, and which task dimensions make problems particularly easy or difficult for generative AI. The method is based on Differential Item Functioning (DIF) analysis — traditionally used to detect bias across demographic groups — together with negative control analysis and item-total correlation discrimination analysis. It is evaluated on responses from human learners and six leading chatbots (ChatGPT-4o & 5.2, Gemini 1.5 & 3 Pro, Claude 3.5 & 4.5 Sonnet) to two instruments: a high school chemistry diagnostic test and a university entrance exam. Subject-matter experts then analyzed DIF-flagged items to characterize task dimensions associated with chatbot over- or under-performance. Results show that DIF-informed analytics provide a robust framework for understanding where LLM and human capabilities diverge, and highlight their value for improving the design of valid, reliable, and fair assessment in the AI era.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844