arXiv:2605.27082v1 Announce Type: new Abstract: Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge suggests relevant mechanisms but is usually too general to map directly onto dataset variables, while data-driven patterns can be dataset-specific and hard to interpret mechanistically. We study this missing link as knowledge contextualization: transforming […]
Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation
arXiv:2605.27134v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile […]
Two Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent
arXiv:2605.27078v1 Announce Type: cross Abstract: Training loss and accuracy are the standard signals used to monitor generalization during deep neural network training. Two well-documented phenomena complicate this picture: in grokking, train loss falls rapidly while test performance improves abruptly only after a long delay; in epoch-wise double descent, train loss decreases monotonically while test loss […]
Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs
arXiv:2605.27157v1 Announce Type: new Abstract: Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their […]
AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
arXiv:2605.18592v2 Announce Type: replace-cross Abstract: Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such as the current batch or instance-level comparisons. This local view discards diagnostic information produced during training, making it difficult to track recurring […]
Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering
arXiv:2605.27249v1 Announce Type: new Abstract: An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student’s current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is […]
E3: Issue-Level Backtesting for Automated Research Critique
arXiv:2605.27072v1 Announce Type: cross Abstract: We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical concerns in research papers. For each concern, E3 reports its nature, its location, its bearing on the contribution, and the analysis or evidence that would resolve it, covering unsupported claims, missing ablations, weak […]
2-ASP(Q) programs with weak constraints: Complexity and efficient implementation
arXiv:2605.27338v1 Announce Type: new Abstract: ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with two quantifiers and weak constraints, denoted as 2-ASP(Q)^w. 2-ASP(Q)^w is a practically relevant fragment of ASP(Q) that is expressive enough to capture optimization problems up to the […]
Echoes in Filter Bubble: Diagnosing and Curing Popularity Bias in Generative Recommenders
arXiv:2605.16825v2 Announce Type: replace-cross Abstract: Recently, Generative Recommenders (GRs), characterized by a unified end-to-end framework, have exhibited astonishing potential in transforming the recommendation paradigm. Despite their effectiveness, we recognize that GRs are still susceptible to the long-standing issue of popularity bias that has pervaded the recommendation community. Although a few studies have attempted to extend […]
Xe-Forge: Multi-Stage LLM-Powered Kernel Optimization for Intel GPU
arXiv:2605.26118v1 Announce Type: cross Abstract: Porting deep learning algorithms to new hardware accelerators requires developers to repeatedly apply the same low-level optimizations — quantization, memory access coalescing, tile size tuning, and architecture-specific workarounds — to every Triton kernel in their code-base. This manual, repetitive effort is a major bottleneck: each kernel demands the same cycle […]
QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
arXiv:2605.27068v1 Announce Type: cross Abstract: Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent’s language […]
Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception
arXiv:2605.26136v1 Announce Type: cross Abstract: Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared […]