arXiv:2604.07355v1 Announce Type: cross Abstract: We introduce Prediction Arena, a benchmark for evaluating AI models’ predictive accuracy and decision-making by enabling them to trade autonomously on live prediction markets with real capital. Unlike synthetic benchmarks, Prediction Arena tests models in environments where trades execute on actual exchanges (Kalshi and Polymarket), providing objective ground truth that […]
Hidden Biases in Conditioning Autoregressive Models
arXiv:2604.07855v1 Announce Type: new Abstract: Large language and music models are increasingly used for constrained generation: rhyming lines, fixed meter, inpainting or infilling, positional endings, and other global form requirements. These systems often perform strikingly well, but the induced procedures are usually not exact conditioning of the underlying autoregressive model. This creates a hidden inferential […]
DialBGM: A Benchmark for Background Music Recommendation from Everyday Multi-Turn Dialogues
arXiv:2604.07895v1 Announce Type: new Abstract: Selecting an appropriate background music (BGM) that supports natural human conversation is a common production step in media and interactive systems. In this paper, we introduce dialogue-conditioned BGM recommendation, where a model should select non-intrusive, fitting music for a multi-turn conversation that often contains no music descriptors. To study this […]
Capture-Quiet Decomposition: A Verification Theorem for Chess Endgame Tablebases
arXiv:2604.07907v1 Announce Type: new Abstract: We present the Capture-Quiet Decomposition (CQD), a structural theorem for verifying Win-Draw-Loss (WDL) labelings of chess endgame tablebases. The theorem decomposes every legal position into exactly one of three categories — terminal, capture, or quiet — and shows that a WDL labeling is correct if and only if: (1) terminal […]
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
arXiv:2604.07927v1 Announce Type: new Abstract: Deep research requires reasoning over web evidence to answer open-ended questions, and it is a core capability for AI agents. Yet many deep research agents still rely on implicit, unstructured search behavior that causes redundant exploration and brittle evidence aggregation. Motivated by Anthropic’s “think” tool paradigm and insights from the […]
WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
arXiv:2604.07957v1 Announce Type: new Abstract: Vision-language models (VLMs) and generative world models are opening new opportunities for embodied navigation. VLMs are increasingly used as direct planners or trajectory predictors, while world models support look-ahead reasoning by imagining future views. Yet predicting a reliable trajectory from a single egocentric observation remains challenging. Current VLMs often generate […]
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
arXiv:2604.07973v1 Announce Type: new Abstract: Large multimodal models (LMMs) show strong visual-linguistic reasoning but their capacity for spatial decision-making and action remains unclear. In this work, we investigate whether LMMs can achieve embodied spatial action like human through a challenging scenario: goal-oriented navigation in urban 3D spaces. We first spend over 500 hours constructing a […]
CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection
arXiv:2604.07487v1 Announce Type: new Abstract: Large language model agents rely on effective model context to obtain task-relevant information for decision-making. Many existing context engineering approaches primarily rely on the context generated from the past experience and retrieval mechanisms that reuse these context. However, retrieved context from past tasks must be adapted by the execution agent […]
Evaluating Counterfactual Explanation Methods on Incomplete Inputs
arXiv:2604.08004v1 Announce Type: new Abstract: Existing algorithms for generating Counterfactual Explanations (CXs) for Machine Learning (ML) typically assume fully specified inputs. However, real-world data often contains missing values, and the impact of these incomplete inputs on the performance of existing CX methods remains unexplored. To address this gap, we systematically evaluate recent CX generation methods […]
“Why This Avoidance Maneuver?” Contrastive Explanations in Human-Supervised Maritime Autonomous Navigation
arXiv:2604.08032v1 Announce Type: new Abstract: Automated maritime collision avoidance will rely on human supervision for the foreseeable future. This necessitates transparency into how the system perceives a scenario and plans a maneuver. However, the causal logic behind avoidance maneuvers is often complex and difficult to convey to a navigator. This paper explores how to explain […]
ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training
arXiv:2604.07484v1 Announce Type: new Abstract: Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from […]
ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
arXiv:2604.08064v1 Announce Type: new Abstract: Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit […]