arXiv:2605.01733v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) excel at grounded reasoning but remain prone to object hallucination. Recent work treats self-generated captions as a uniformly positive resource, yet we find that naively embedding one can degrade rather than help–dropping Qwen2.5-VL-3B accuracy on HallusionBench by nearly 10 points. Two structural properties explain this. First, captions […]
Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation
arXiv:2604.28031v2 Announce Type: replace-cross Abstract: When researchers iteratively refine ideas with large language models, do the models preserve fidelity to the original objective? We introduce DriftBench, a benchmark for evaluating constraint adherence in multi-turn LLM-assisted scientific ideation. Across 2,146 scored benchmark runs spanning seven models from five providers (including two open-weight), four interaction conditions, and […]
Research on Vision-Language Question Answering Models for Industrial Robots
arXiv:2605.01483v1 Announce Type: cross Abstract: A hierarchical cross-modal fusion model is proposed for vision-language question answering (VLQA) in industrial robotics, targeting the challenges of semantic ambiguity, complex environmental layouts, and domain-specific terminology common in modern manufacturing. The framework integrates advanced object detection, multi-scale visual encoding, syntactic parsing, and task-aware semantic attention to unite vision and […]
Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning
arXiv:2605.01865v1 Announce Type: cross Abstract: Cooperative multi-agent reinforcement learning (MARL) requires agents to discover joint strategies in a combinatorially large state-action space, yet effective coordination configurations are exceedingly rare. Intrinsic motivation, which augments task rewards with novelty bonuses, is a popular approach for driving exploration, but its effectiveness hinges on the exploration intensity $beta$, where […]
Retrieval-Guided Generation for Safer Histopathology Image Captioning
arXiv:2605.00893v1 Announce Type: cross Abstract: Generative vision-language models can produce fluent medical image captions but remain prone to hallucination, over-specific diagnostic claims, and factual inconsistency-serious issues in pathology. We investigate retrieval-guided generation (RGG) as a safer alternative, where captions are formed by summarizing expert text from visually similar cases rather than generated de novo. On […]
EventADL: Open-Box Anomaly Detection and Localization Framework for Events in Cloud-Based Service Systems
arXiv:2605.00936v1 Announce Type: cross Abstract: Anomaly detection and localization (ADL) is critical for maintaining reliability and availability in cloud systems. Recent ADL developments focus on metric and log data, leaving event data unexplored. To address this gap, we propose EventADL, the first open-box event-based ADL framework for cloud-based service systems. To motivate the design of […]
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
arXiv:2605.00994v1 Announce Type: cross Abstract: Finetuning can significantly modify the behavior of large language models, including introducing harmful or unsafe behaviors. To study these risks, researchers develop model organisms: models finetuned to exhibit specific known behaviors for controlled experimentation. Identifying these behaviors remains challenging. We show that a simple perplexity-based method can surface finetuning objectives […]
Multi-Perspective Transformers in ARC-AGI-2 Challenge
arXiv:2605.01154v1 Announce Type: cross Abstract: ARC-AGI-2 is a benchmark of human-intuitive visual puzzles that measures a machine’s ability to generalize from limited examples, interpret symbolic meaning, and flexibly apply rules in varying contexts. In this paper, we discuss our approach to solving the ARC-AGI-2 puzzles with TinyLM, with additional fine-tuning at test time, including Test-Time-Training […]
ABox Abduction for Inconsistent Knowledge Bases under Repair Semantics
arXiv:2605.01341v1 Announce Type: cross Abstract: Given a knowledge base (KB) with a non-entailed fact, the ABox abduction problem asks for possible extensions of the KB that would entail this fact. This problem has many applications, ranging from diagnosis to explainability and repair. ABox abduction has been well-investigated for consistent KBs and classical semantics, but little […]
HepScript: A Dual-Use DSL for Human-AI Collaborative Data Analysis Workflows in High-Energy Physics
arXiv:2605.01423v1 Announce Type: cross Abstract: The escalating data scale in High-Energy Physics (HEP) fuels a growing aspiration for higher analytical efficiency. While Large Language Models (LLMs) offer a path toward automation via agentic AI, they struggle with complex scientific workflows that require deep domain knowledge and are tightly coupled to experiment-specific codebases. To address this, […]
Model Merging: Foundations and Algorithms
arXiv:2605.01580v1 Announce Type: cross Abstract: Modern deep learning usually treats models as separate artifacts: trained independently, specialized for particular purposes, and replaced when improved versions appear. This thesis studies model merging as an alternative paradigm: combining independently trained neural networks directly in weight space, with little or no optimization and without requiring access to the […]
GRAVITY: Architecture-Agnostic Structured Anchoring for Long-Horizon Conversational Memory
arXiv:2605.01688v1 Announce Type: cross Abstract: Long-horizon conversational agents rely on memory systems with increasingly sophisticated retrieval mechanisms. However, retrieved fragments are typically fed to the language model as unstructured text, lacking the relational, temporal, and thematic structures essential for complex reasoning. To bridge this reasoning gap, we introduce GRAVITY (textbfGeneration-time textbfRelational textbfAnchoring textbfVia textbfInjected textbfTopological […]