arXiv:2603.29231v1 Announce Type: new
Abstract: Existing benchmarks measure capability — whether a model succeeds on a single attempt — but production deployments
require reliability — consistent success across repeated attempts on tasks of varying duration. We show these
properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to
this divergence.
We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve
(RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We
evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains.
Key findings: (1) reliability decay is domain-stratified — SE GDS drops from 0.90 to 0.44 while document processing
is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier — high VAF is a capability signature, not an
instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long
horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi-step
strategies that sometimes spiral; and (5) memory scaffolds universally hurt long-horizon performance across all 10
models. These results motivate reliability as a first-class evaluation dimension alongside capability.
When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don’t
arXiv:2604.06422v1 Announce Type: cross Abstract: Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere


