arXiv:2603.03482v2 Announce Type: replace-cross Abstract: Interactive world models continually generate video by responding to a user’s actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an […]
Competition, stability, and functionality in excitatory-inhibitory neural circuits
arXiv:2512.05252v2 Announce Type: replace Abstract: Energy-based models have become a central paradigm for understanding computation and stability in both theoretical neuroscience and machine learning. However, the energetic framework typically relies on symmetry in synaptic or weight matrices – a constraint that excludes biologically realistic systems such as excitatory-inhibitory (E-I) networks. When symmetry is relaxed, the […]
CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems
arXiv:2407.13922v3 Announce Type: replace-cross Abstract: Face recognition (FR) systems are widely deployed in critical applications, making their reliability and robustness across diverse populations and conditions essential. Standard evaluation of FR systems typically relies on datasets such as LFW to estimate average recognition accuracy. Some benchmarks also capture coarse-grained intra-identity variations such as aging, pose, and […]
AnchorMoE: Interpretable Time Series Classification via Anchor-Routed MoE
arXiv:2606.03631v2 Announce Type: replace-cross Abstract: Multivariate time series classification (MTSC) is pivotal in high-stakes domains, such as clinical diagnosis and industrial fault detection, where safe deployment necessitates transparent decision-making. However, isolating the temporal segments that drive model predictions is challenging because discriminative signals in real-world time series are typically sparse, heterogeneous, and heavily obscured by […]
Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance
arXiv:2606.04970v1 Announce Type: cross Abstract: We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding textitwhen to interrupt, and textithow to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from […]
Consistency Training Can Entrench Misalignment
arXiv:2606.03810v2 Announce Type: replace-cross Abstract: Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify undesired behavior in models? We test seven consistency training methods […]
DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving
arXiv:2606.04987v1 Announce Type: cross Abstract: Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages […]
From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents
arXiv:2606.04990v1 Announce Type: cross Abstract: Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence […]
q0: Primitives for Hyper-Epoch Pretraining
arXiv:2606.03938v2 Announce Type: replace-cross Abstract: Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring […]
Oxygenation and spatial heterogeneity shape radiotherapy protocol ranking through phenotypic adaptation
arXiv:2606.04004v1 Announce Type: new Abstract: Tumor response to radiotherapy is strongly influenced by oxygen availability and phenotypic heterogeneity, yet their combined impact on the relative performance of fractionation schedules remains unclear. Here, we develop a mathematical model that integrates spatial oxygen dynamics with continuous phenotypic adaptation to hypoxia and radiation, and use it to systematically […]
Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks
arXiv:2606.03606v2 Announce Type: replace-cross Abstract: Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must reason directly from natural language, and trustworthy models should solve small-number arithmetic word problems without external […]
PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models
arXiv:2606.03598v2 Announce Type: replace-cross Abstract: Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forgetting of previously learned behaviors. While experience replay (ER) serves as a standard mitigating strategy, naive uniform sampling fundamentally misaligns […]