Disclosure in the era of generative artificial intelligence

Generative artificial intelligence (AI) has rapidly become embedded in academic writing, assisting with tasks ranging from language editing to drafting text and producing evidence. Despite

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

arXiv:2604.27495v1 Announce Type: cross Abstract: Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often

Why Self-Supervised Encoders Want to Be Normal

arXiv:2604.27743v1 Announce Type: cross Abstract: We develop a geometric and information-theoretic framework for encoder-decoder learning built on the Information Bottleneck (IB) principle. Recasting IB as

PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

arXiv:2508.21787v2 Announce Type: replace-cross Abstract: Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions

CausalCompass: Evaluating the Robustness of Time-Series Causal Discovery in Misspecified Scenarios

arXiv:2602.07915v2 Announce Type: replace-cross Abstract: Causal discovery from time series is a fundamental task in machine learning. However, its widespread adoption is hindered by a

How Much Heavy Lifting Can an Agent Harness Do?: Measuring the LLM’s Residual Role in a Planning Agent

April 28, 2026

arXiv:2604.07236v3 Announce Type: replace
Abstract: Agent harnesses — the stateful programs that wrap a language model and decide what it sees at each step — are now known to change end-to-end performance on a fixed model by as much as six times. That observation raises a question asked less often than it should be: once the harness is serious, how much of an agent’s competence does the harness itself already carry, and how much genuinely still needs the LLM? We study this in noisy Collaborative Battleship, a partially observable planning setting with belief update, information-gathering questions, and uncertainty-aware action selection. We externalize a planning harness into four progressively richer layers — posterior belief tracking, declarative planning, symbolic reflection, and an LLM-backed revision gate — and report per-layer contribution under a common runtime. We report emphwin rate as the primary, game-level metric and emphF1 as a secondary, local-targeting indicator, and pre-specify emphheavy lifting as the single largest positive marginal to the primary metric. Across 54 games, the declarative planning layer does most of the heavy lifting under this criterion, raising win rate from 50.0% (Wilson 95% CI $[37.1,62.9]$) to 74.1% ($[61.1,83.9]$) over a belief-only harness (+24.1pp, +0.017 F1). Symbolic reflection is mechanistically real but calibration-sensitive, shifting board-level outcomes by up to $pm0.140$ F1 without being net-positive on aggregate. LLM-backed revision activates on only 4.3% of turns at the strictest confidence threshold and yields a small, non-monotonic change (+0.005 F1, -3.7pp win rate). The contribution is methodological: once harness layers are made externally measurable, one can ask not only how far the harness already carries the agent, but also where the LLM’s role is actually residual rather than central.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844