Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations

arXiv:2604.22207v1 Announce Type: cross Abstract: Due to the textual and repetitive nature of many Requirements Engineering (RE) artefacts, Large Language Models (LLMs) have proven useful

ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation

arXiv:2604.22169v1 Announce Type: cross Abstract: Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down

ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders

arXiv:2604.22550v1 Announce Type: cross Abstract: Self-supervised learning (SSL) encoders are invaluable intellectual property (IP). However, no existing SSL watermarking for IP protection can concurrently satisfy

Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation

arXiv:2604.22102v1 Announce Type: cross Abstract: Many robotic tasks are unforgiving; a single mistake in a dynamic throw can lead to unacceptable delays or unrecoverable failure.

ReLeVAnT: Relevance Lexical Vectors for Accurate Legal Text Classification

arXiv:2604.22292v1 Announce Type: cross Abstract: The classification of legal documents from an unstructured data corpus has several crucial applications in downstream tasks. Documents relevant to

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

April 27, 2026

arXiv:2604.22154v1 Announce Type: cross
Abstract: Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844