arXiv:2603.17234v1 Announce Type: cross Abstract: Surgical co-management (SCM) is an evidence-based model in which hospitalists jointly manage medically complex perioperative patients alongside surgical teams. Despite its clinical and financial value, SCM is limited by the need to manually identify eligible patients. To determine whether SCM triage can be automated, we conducted a prospective, unblinded study […]
Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding
arXiv:2603.17307v1 Announce Type: cross Abstract: Despite rapid developments and widespread applications of MLLM agents, they still struggle with long-form video understanding (LVU) tasks, which are characterized by high information density and extended temporal spans. Recent research on LVU agents demonstrates that simple task decomposition and collaboration mechanisms are insufficient for long-chain reasoning tasks. Moreover, directly […]
AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement
arXiv:2603.17441v1 Announce Type: cross Abstract: GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, […]
KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition
arXiv:2603.17524v1 Announce Type: cross Abstract: In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and […]
Joint Optimization of Storage and Loading for High-Performance 3D Point Cloud Data Processing
arXiv:2603.16945v1 Announce Type: cross Abstract: With the rapid development of computer vision and deep learning, significant advancements have been made in 3D vision, partic- ularly in autonomous driving, robotic perception, and augmented reality. 3D point cloud data, as a crucial representation of 3D information, has gained widespread attention. However, the vast scale and complexity of […]
PhysQuantAgent: An Inference Pipeline of Mass Estimation for Vision-Language Models
arXiv:2603.16958v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) are increasingly applied to robotic perception and manipulation, yet their ability to infer physical properties required for manipulation remains limited. In particular, estimating the mass of real-world objects is essential for determining appropriate grasp force and ensuring safe interaction. However, current VLMs lack reliable mass reasoning capabilities, […]
MSRAMIE: Multimodal Structured Reasoning Agent for Multi-instruction Image Editing
arXiv:2603.16967v1 Announce Type: cross Abstract: Existing instruction-based image editing models perform well with simple, single-step instructions but degrade in realistic scenarios that involve multiple, lengthy, and interdependent directives. A main cause is the scarcity of training data with complex multi-instruction annotations. However, it is costly to collect such data and retrain these models. To address […]
The State of Generative AI in Software Development: Insights from Literature and a Developer Survey
arXiv:2603.16975v1 Announce Type: cross Abstract: Generative Artificial Intelligence (GenAI) rapidly transforms software engineering, yet existing research remains fragmented across individual tasks in the Software Development Lifecycle. This study integrates a systematic literature review with a survey of 65 software developers. The results show that GenAI exerts its highest impact in design, implementation, testing, and documentation, […]
Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization
arXiv:2603.17052v1 Announce Type: cross Abstract: Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain […]
REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge
arXiv:2603.17145v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm known as LLM-as-a-Judge. However, standard Reinforcement Learning (RL) methods typically rely on binary rewards (e.g., 0-1 accuracy), thereby ignoring the ordinal structure inherent in regression tasks; for instance, they fail to […]
Generalist Multimodal LLMs Gain Biometric Expertise via Human Salience
arXiv:2603.17173v1 Announce Type: cross Abstract: Iris presentation attack detection (PAD) is critical for secure biometric deployments, yet developing specialized models faces significant practical barriers: collecting data representing future unknown attacks is impossible, and collecting diverse-enough data, yet still limited in terms of its predictive power, is expensive. Additionally, sharing biometric data raises privacy concerns. Due […]
Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing
arXiv:2603.17199v1 Announce Type: cross Abstract: Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response […]