Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning

arXiv:2605.19185v1 Announce Type: cross Abstract: Sparse goal-conditioned planning with few cost-to-go labels can be viewed as a graph-PDE Dirichlet extension problem: extend sparse labels on a goal-dependent boundary to unlabelled graph vertices so that greedy rollouts reach the goal. We study which graph value extensions are planner-admissible under the operational argmin-Q planner. Our main result […]

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

arXiv:2605.19099v1 Announce Type: new Abstract: We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering […]

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

arXiv:2605.19228v1 Announce Type: cross Abstract: Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, […]

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

arXiv:2605.08146v2 Announce Type: replace-cross Abstract: Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce textitVT-Bench, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across […]

Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection

arXiv:2605.19285v1 Announce Type: cross Abstract: The rapid spread of misinformation on social media platforms has become a formidable challenge. To mitigate its proliferation, Misinformation Detection (MD) has emerged as a critical research topic. Traditional MD approaches based on small models typically perform binary classification through a black-box process. Recently, the rise of Large Language Models […]

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

arXiv:2605.19127v1 Announce Type: new Abstract: LLM agents increasingly have access to private user data and act on the user’s behalf when interacting with third-party systems. The user defines what may and must not be shared, and the agent must robustly follow that intent even when third-party systems behave adversarially. We introduce POLAR-Bench (Policy-aware adversarial Benchmark), […]

Unlocking the Potential of Continual Model Merging: An ODE Perspective

arXiv:2605.19409v1 Announce Type: cross Abstract: Continual Model Merging (CMM) enables rapid customization of foundation models across sequentially arriving tasks, offering a scalable alternative to repeated retraining. However, existing merging rules lack explicit controllability over the allocation of learning capacity between previously learned capabilities and newly merged models. Consequently, as tasks are merged sequentially, this deficiency […]

Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

arXiv:2605.18824v1 Announce Type: cross Abstract: Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness […]

TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization

arXiv:2605.19561v1 Announce Type: cross Abstract: As Large Language Models (LLMs) advance toward practical deployment, the Microscaling FP4 (MXFP4) format has emerged as a cornerstone for next-generation low-bit inference, owing to its ability to balance high dynamic range with hardware efficiency. However, directly applying MXFP4 to LLM activation quantization inevitably leads to significant accuracy degradation. In […]

The fitness landscape of social norms in social dilemmas

arXiv:2605.18834v1 Announce Type: cross Abstract: By specifying behaviour across multiple agents, social norms are a coordination approach to resolving social dilemmas. Decentralized and wide adoption can be achieved by norms whose prescription involves interpreting stochastic signals in the environment. Such signals must have enough correlation to orchestrate mutually beneficial coordination and enough disincentivizing uncertainty about […]

Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

arXiv:2605.19151v1 Announce Type: new Abstract: We formalize trust calibration for agentic tool use (deciding when an automated agent’s proposed action may execute autonomously versus require human approval) as a preference-learning problem. A policy gateway maintains a Gaussian-process posterior over a latent human risk-tolerance function, observed through a probit likelihood on binary approve/deny feedback, and escalates […]

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844