arXiv:2601.22984v2 Announce Type: replace Abstract: Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring intermediate hallucinations that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to processaware evaluation by auditing hallucinations in the full plan-search-summarize trajectory. We […]
PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection
arXiv:2605.24171v1 Announce Type: cross Abstract: Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains uncharacterized. We present PromptAudit, a controlled evaluation framework that isolates prompt effects by fixing the dataset, decoding, and parsing while varying only the prompting strategy. Using five prompting strategies across five open-weight models […]
Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning
arXiv:2605.24058v1 Announce Type: cross Abstract: On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-specific LoRA adapter. In the unmerged adapter-mode setting, however, the adapter is more than a compact storage module; it introduces an additional dense floating-point branch, maintains a trainable state for local […]
Scaling up Energy-Aware Multi-Agent Reinforcement Learning for Mission-Oriented Drone Networks with Individual Reward
arXiv:2605.24992v1 Announce Type: cross Abstract: Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities for its ability of learning through interaction. With the recent development of drone networks, researchers have also applied MARL to address the trajectory planning problems. However, the dynamic environment and the limited […]
Rethinking Federated Unlearning via the Lens of Memorization
arXiv:2605.24545v1 Announce Type: cross Abstract: Federated learning (FL) increasingly needs machine unlearning to comply with privacy regulations. However, existing federated unlearning approaches may overlook the overlapping information between the unlearning and remaining data, leading to ineffective unlearning and unfairness between clients. In this work, we revisit federated unlearning through the lens of memorization. We argue […]
CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM
arXiv:2605.24786v1 Announce Type: cross Abstract: Long-horizon LLM inference turns the key–value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model’s current uncertainty. We introduce CONF-KV, a KV-cache manager […]
SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?
arXiv:2605.15777v2 Announce Type: replace Abstract: Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to […]
CARL-CXR: Continual Adapter-Based Routing for Task-Unknown Chest Radiograph Classification
arXiv:2602.15811v2 Announce Type: replace-cross Abstract: Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously observed data or degrading validated performance. We study a task-incremental continual learning setting for chest radiograph classification under task-unknown inference, where heterogeneous chest X-ray datasets arrive sequentially and task […]
When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure
arXiv:2605.23932v1 Announce Type: new Abstract: Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct diagnosis under escalating pressure. We propose textbftextscMed-Stress, a targeted stress test framework that evaluates belief stability under escalating pressure. Across nine frontier large language models (LLMs), we find a clear dissociation between medical […]
INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models
arXiv:2510.01389v2 Announce Type: replace-cross Abstract: Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present textbfINSIGHT, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $pi_0$-FAST as the underlying model, we extract […]
ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks
arXiv:2605.25388v1 Announce Type: cross Abstract: Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we […]
Persona-Model Collapse in Emergent Misalignment
arXiv:2605.12850v2 Announce Type: replace-cross Abstract: Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model’s internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using […]