arXiv:2604.10985v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have rapidly advanced by leveraging powerful pre-trained Large Language Models (LLMs) as core reasoning backbones. As new and more capable LLMs emerge with improved reasoning, instruction-following, and generalization, there is a pressing need to efficiently update existing VLMs to incorporate these advancements. However, the integration of new […]
Sanity Checks for Agentic Data Science
arXiv:2604.11003v1 Announce Type: new Abstract: Agentic data science (ADS) pipelines have grown rapidly in both capability and adoption, with systems such as OpenAI Codex now able to directly analyze datasets and produce answers to statistical questions. However, these systems can reach falsely optimistic conclusions that are difficult for users to detect. To address this, we […]
OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling
arXiv:2604.09580v1 Announce Type: new Abstract: Standard Chain-of-Thought (CoT) prompting empowers Large Language Models (LLMs) with reasoning capabilities, yet its reliance on linear natural language is inherently insufficient for effective world modeling in embodied tasks. While text offers flexibility, it fails to explicitly represent the state-space, object hierarchies, and causal dependencies required for robust robotic planning. […]
Intelligent Approval of Access Control Flow in Office Automation Systems via Relational Modeling
arXiv:2604.11040v1 Announce Type: new Abstract: Office automation (OA) systems play a crucial role in enterprise operations and management, with access control flow approval (ACFA) being a key component that manages the accessibility of various resources. However, traditional ACFA requires approval from the person in charge at each step, which consumes a significant amount of manpower […]
RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
arXiv:2604.11626v1 Announce Type: new Abstract: Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at […]
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
arXiv:2604.10708v1 Announce Type: cross Abstract: Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and […]
Beyond Message Passing: A Semantic View of Agent Communication Protocols
arXiv:2604.02369v3 Announce Type: replace-cross Abstract: Agent communication protocols are becoming critical infrastructure for large language model (LLM) systems that must use tools, coordinate with other agents, and operate across heterogeneous environments. This work presents a human-inspired perspective on this emerging landscape by organizing agent communication into three layers: communication, syntactic, and semantic. Under this framework, […]
SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting
arXiv:2604.10688v1 Announce Type: cross Abstract: On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, […]
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
arXiv:2603.27494v2 Announce Type: replace-cross Abstract: To enhance the perception and reasoning capabilities of multimodal large language models in complex visual scenes, recent research has introduced agent-based workflows. In these works, MLLMs autonomously utilize image cropping tool to analyze regions of interest for question answering. While existing training strategies, such as those employing supervised fine-tuning and […]
One Scale at a Time: Scale-Autoregressive Modeling for Fluid Flow Distributions
arXiv:2604.11403v1 Announce Type: cross Abstract: Analyzing unsteady fluid flows often requires access to the full distribution of possible temporal states, yet conventional PDE solvers are computationally prohibitive and learned time-stepping surrogates quickly accumulate error over long rollouts. Generative models avoid compounding error by sampling states independently, but diffusion and flow-matching methods, while accurate, are limited […]
What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models
arXiv:2601.06165v2 Announce Type: replace-cross Abstract: Current vision-language benchmarks predominantly feature well-structured questions with clear, explicit prompts. However, real user queries are often informal and underspecified. Users naturally leave much unsaid, relying on images to convey context. We introduce HAERAE-Vision, a benchmark of 653 real-world visual questions from Korean online communities (0.76% survival from 86K candidates), […]
Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank’s Event Semantics
arXiv:2603.25975v2 Announce Type: replace-cross Abstract: We show that they do. Roger Schank’s conceptual dependency theory proposed that all human events decompose into primitive operations — ATRANS (transfer of possession), PTRANS (physical movement), MTRANS (information transfer), and others — hand-coded from linguistic intuition. We ask: can the same primitives be discovered automatically through compression pressure alone? […]