arXiv:2601.04695v2 Announce Type: replace
Abstract: Out-of-distribution generalization in reinforcement learning is hard to diagnose when benchmark shifts mix dynamics, observations, goals, and rewards. We address this with Tape, a controlled benchmark that isolates latent rule-shift in dynamics while keeping the observation-action interface fixed. The protocol combines deterministic splits, 20-seed replication, bootstrap uncertainty reporting, and continuous metrics for sparse-success regimes. Across baseline families, we find a consistent ID-to-OOD drop and strong heterogeneity across stable/periodic/chaotic rules. Importantly, this fragility appears even in an intentionally simple 1D deterministic setting, suggesting that many current RL algorithms remain brittle to latent-law changes under minimal confounds. To calibrate strict success, we report a protocol-matched true-dynamics random-shooting reference (p_oracle is almost 0.187) and oracle-normalized scores ON(p) = 100 p / p_oracle; this is a budgeted operational reference, not a global-optimality bound. A smaller feasibility regime (L = H = 16) with 100% rule-wise solvability helps separate reachability limits from policy failure. These results position Tape as a mechanism-oriented diagnostic for robust adaptation and latent-mechanism inference, and as a controlled benchmark relevant to broader AGI-oriented evaluation without making strong AGI sufficiency claims.
AI needs a strong data fabric to deliver business value
Artificial intelligence is moving quickly in the enterprise, from experimentation to everyday use. Organizations are deploying copilots, agents, and predictive systems across finance, supply chains,


