arXiv:2605.28751v1 Announce Type: cross Abstract: Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit […]
A Policy-Driven Runtime Layer for Agentic LLM Serving
arXiv:2605.27744v1 Announce Type: new Abstract: Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising […]
Domain size asymptotics for Markov logic networks
arXiv:2509.04192v2 Announce Type: replace Abstract: A Markov logic network (MLN) $mathbbM$ determines a probability distribution $mathbbP_n^mathbbM$ on the set $mathbfW_n$ of structures, or “possible worlds”, with domain $1, ldots, n$. We study the properties of such distributions as $n$ tends to infinity. We show that with mild assumptions on an MLN $mathbbM$ with one soft […]
Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration
arXiv:2605.27752v1 Announce Type: new Abstract: LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals are sometimes treated as direct readouts of model uncertainty, but their comparison depends on measurement choices that are rarely made explicit. In the main analysis, we hold the verbalized-confidence elicitation fixed: a single […]
DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths
arXiv:2603.00309v2 Announce Type: replace Abstract: The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents to collaboratively complete complex tasks. While many agentic AI systems reduce complexity through predefined workflows or fixed agent roles, the ideal is to support truly autonomous agents capable of emergent collaboration […]
Let Relations Speak: An End-to-End LLM-GNN Soft Prompt Framework for Fraud Detection
arXiv:2605.28524v1 Announce Type: new Abstract: In recent years, Large Language Models (LLMs) have shown great capability in processing graph tasks such as fraud detection. However, most existing methods rely heavily on rich text attributes, which poses difficulties for this domain due to the lack of textual data. Although some pioneering methods attempt to overcome it, […]
Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning
arXiv:2603.09882v2 Announce Type: replace-cross Abstract: Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short […]
Entropy-aware Masking for Masked Language Modeling
arXiv:2605.28526v1 Announce Type: new Abstract: Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, […]
ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations
arXiv:2605.27908v1 Announce Type: cross Abstract: Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), […]
The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages
arXiv:2605.27901v1 Announce Type: cross Abstract: Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, […]
Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval
arXiv:2605.11325v2 Announce Type: replace-cross Abstract: Every major benchmark for LLM memory systems, LoCoMo foremost, measures whether a model answered correctly, not whether the memory system retrieved correctly. A system returning its entire belief store achieves recall of 1.0 and passes answer-quality evaluation. This is the difference between a unit test and an integration test: retrieval […]
Why LLMs Fail at Causal Discovery and How Interventional Agents Escape
arXiv:2605.27567v1 Announce Type: new Abstract: Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure […]