arXiv:2604.10701v1 Announce Type: cross Abstract: Credit assignment is a central challenge in reinforcement learning (RL). Classical actor-critic methods address this challenge through fine-grained advantage estimation based on a learned value function. However, learned value models are often avoided in modern large language model (LLM) RL because conventional discriminative critics are difficult to train reliably. We […]
Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models
arXiv:2604.10893v1 Announce Type: cross Abstract: Watermarking provides a critical safeguard for large language model (LLM) services by facilitating the detection of LLM-generated text. Correspondingly, stealing watermark algorithms (SWAs) derive watermark information from watermarked texts generated by victim LLMs to craft highly targeted adversarial attacks, which compromise the reliability of watermarks. Existing SWAs rely on fixed […]
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
arXiv:2604.09554v1 Announce Type: new Abstract: Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not […]
GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
arXiv:2603.24329v2 Announce Type: replace-cross Abstract: Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not […]
Your Model Diversity, Not Method, Determines Reasoning Strategy
arXiv:2604.10827v1 Announce Type: new Abstract: Compute scaling for LLM reasoning requires allocating budget between exploring solution approaches ($breadth$) and refining promising solutions ($depth$). Most methods implicitly trade off one for the other, yet why a given trade-off works remains unclear, and validation on a single model obscures the role of the model itself. We argue […]
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
arXiv:2604.10506v1 Announce Type: new Abstract: Vision-Language Models (VLMs) have made significant strides in static image understanding but continue to face critical hurdles in spatiotemporal reasoning. A major bottleneck is “multi-image reasoning hallucination”, where a massive performance drop between forward and reverse temporal queries reveals a dependence on superficial shortcuts instead of genuine causal understanding. To […]
Working Paper: Towards Schema-based Learning from a Category-Theoretic Perspective
arXiv:2604.10589v1 Announce Type: new Abstract: We introduce a hierarchical categorical framework for Schema-Based Learning (SBL) structured across four interconnected levels. At the schema level, a free multicategory $Sch_syn$ encodes fundamental schemas and transformations. An implementation functor $mathcalI$ maps syntactic schemas to representational languages, inducing via the Grothendieck construction the total category $Sch_impl$. Implemented schemas are […]
Inspectable AI for Science: A Research Object Approach to Generative AI Governance
arXiv:2604.11261v1 Announce Type: new Abstract: This paper introduces AI as a Research Object (AI-RO), a paradigm for governing the use of generative AI in scientific research. Instead of debating whether AI is an author or merely a tool, we propose treating AI interactions as structured, inspectable components of the research process. Under this view, the […]
Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems
arXiv:2604.08963v2 Announce Type: replace-cross Abstract: While Multi-Agent Systems (MAS) are increasingly deployed for complex workflows, their emergent properties-particularly the accumulation of bias-remain poorly understood. Because real-world MAS are too complex to analyze entirely, evaluating their ethical robustness requires first isolating their foundational mechanics. In this work, we conduct a baseline empirical study investigating how basic […]
TInR: Exploring Tool-Internalized Reasoning in Large Language Models
arXiv:2604.10788v1 Announce Type: cross Abstract: Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models’ (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we […]
Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series
arXiv:2604.10799v1 Announce Type: cross Abstract: The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal […]
Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis
arXiv:2604.10800v1 Announce Type: cross Abstract: Learned classifiers deployed in agentic pipelines face a fundamental reliability problem: predictions are probabilistic inferences, not verified conclusions, and acting on them without grounding in observable evidence leads to compounding failures across downstream stages. Software vulnerability analysis makes this cost concrete and measurable. We address this through a unified cross-language […]