arXiv:2604.19839v1 Announce Type: cross Abstract: Vision-language models (VLMs) have shown strong perception and reasoning abilities for instruction-following embodied agents. However, despite these abilities and their generalization performance, they still face limitations in environmental understanding, often failing on interactions or relying on environment metadata during execution. To address this challenge, we propose a novel framework named […]
Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models
arXiv:2511.06209v4 Announce Type: replace Abstract: LLMs can solve complex tasks by generating long, multi-step reasoning chains. Test-time scaling (TTS) can further improve LLM performance by sampling multiple variants of intermediate reasoning steps, verifying their correctness, and strategically choosing the best steps for continuation. However, existing verification approaches, such as Process Reward Models (PRMs), are computationally […]
DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories
arXiv:2604.20443v1 Announce Type: cross Abstract: Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal […]
Transparent Screening for LLM Inference and Training Impacts
arXiv:2604.19757v1 Announce Type: cross Abstract: This paper presents a transparent screening framework for estimating inference and training impacts of current large language models under limited observability. The framework converts natural-language application descriptions into bounded environmental estimates and supports a comparative online observatory of current market models. Rather than claiming direct measurement for opaque proprietary services, […]
PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning
arXiv:2604.12652v2 Announce Type: replace-cross Abstract: Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires emphno annotation […]
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
arXiv:2604.20140v1 Announce Type: new Abstract: Direct Preference Optimization (DPO) is an effective framework for aligning large language models with human preferences, but it struggles with complex reasoning tasks. DPO optimizes for the likelihood of generating preferred over dispreferred responses in their entirety and lacks the granularity to provide feedback on subsections of many-step solutions typical […]
CHASM: Unveiling Covert Advertisements on Chinese Social Media
arXiv:2604.20511v1 Announce Type: cross Abstract: Current benchmarks for evaluating large language models (LLMs) in social media moderation completely overlook a serious threat: covert advertisements, which disguise themselves as regular posts to deceive and mislead consumers into making purchases, leading to significant ethical and legal concerns. In this paper, we present the CHASM, a first-of-its-kind dataset […]
Improving Molecular Force Fields with Minimal Temporal Information
arXiv:2604.19806v1 Announce Type: cross Abstract: Accurate prediction of energy and forces for 3D molecular systems is one of fundamental challenges at the core of AI for Science applications. Many powerful and data-efficient neural networks predict molecular energies and forces from single atomic configurations. However, one crucial aspect of the data generation process is rarely considered […]
EgoSelf: From Memory to Personalized Egocentric Assistant
arXiv:2604.19564v2 Announce Type: replace-cross Abstract: Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we […]
Measuring Creativity in the Age of Generative AI: Distinguishing Human and AI-Generated Creative Performance in Hiring and Talent Systems
arXiv:2604.19799v1 Announce Type: cross Abstract: Generative AI is rapidly transforming how organizations create value and evaluate talent. While large language models enhance baseline output quality, they simultaneously introduce ambiguity in assessing human creativity, as observable artifacts may be partially or fully AI-generated. This paper reconceptualizes creativity as a distributional and process-based property that emerges under […]
The Tool-Overuse Illusion: Why Does LLM Prefer External Tools over Internal Knowledge?
arXiv:2604.19749v1 Announce Type: new Abstract: Equipping LLMs with external tools effectively addresses internal reasoning limitations. However, it introduces a critical yet under-explored phenomenon: tool overuse, the unnecessary tool-use during reasoning. In this paper, we first reveal this phenomenon is pervasive across diverse LLMs. We then experimentally elucidate its underlying mechanisms through two key lenses: (1) […]
Assessing the Robustness of Climate Foundation Models under No-Analog Distribution Shifts
arXiv:2603.23043v2 Announce Type: replace-cross Abstract: The accelerating pace of climate change introduces profound non-stationarities that challenge the ability of Machine Learning based climate emulators to generalize beyond their training distributions. While these emulators offer computationally efficient alternatives to traditional Earth System Models, their reliability remains a potential bottleneck under “no-analog” future climate states, which we […]