arXiv:2604.18724v1 Announce Type: new Abstract: Users typically interact with and evaluate language models via single outputs, but each output is just one sample from a broad distribution of possible completions. This interaction hides distributional structure such as modes, uncommon edge cases, and sensitivity to small prompt changes, leading users to over-generalize from anecdotes when iterating […]
CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering
arXiv:2603.16091v2 Announce Type: replace-cross Abstract: In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers […]
GRAIL:Learning to Interact with Large Knowledge Graphs for Retrieval Augmented Reasoning
arXiv:2508.05498v2 Announce Type: replace Abstract: Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) techniques have exhibited remarkable performance across a wide range of domains. However, existing RAG approaches primarily operate on unstructured data and demonstrate limited capability in handling structured knowledge such as knowledge graphs. Meanwhile, current graph retrieval methods fundamentally struggle to capture […]
Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
arXiv:2604.19667v1 Announce Type: cross Abstract: At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements […]
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
arXiv:2604.19083v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating […]
Think Before Writing: Feature-Level Multi-Objective Optimization for Generative Citation Visibility
arXiv:2604.19113v1 Announce Type: cross Abstract: Generative answer engines expose content through selective citation rather than ranked retrieval, fundamentally altering how visibility is determined. This shift calls for new optimization methods beyond traditional search engine optimization. Existing generative engine optimization (GEO) approaches primarily rely on token-level text rewriting, offering limited interpretability and weak control over the […]
Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications
arXiv:2604.19281v1 Announce Type: cross Abstract: The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model’s answers match semantically, and therefore do not provide a […]
Counting Worlds Branching Time Semantics for post-hoc Bias Mitigation in generative AI
arXiv:2604.19431v1 Announce Type: cross Abstract: Generative AI systems are known to amplify biases present in their training data. While several inference-time mitigation strategies have been proposed, they remain largely empirical and lack formal guarantees. In this paper we introduce CTLF, a branching-time logic designed to reason about bias in series of generative AI outputs. CTLF […]
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
arXiv:2604.18803v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded […]
Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning
arXiv:2604.18862v1 Announce Type: cross Abstract: Bug reports, encompassing a wide range of bug types, are crucial for maintaining software quality. However, the increasing complexity and volume of bug reports pose a significant challenge in sole manual identification and assignment to the appropriate teams for resolution, as dealing with all the reports is time-consuming and resource-intensive. […]
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
arXiv:2604.18914v1 Announce Type: cross Abstract: While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions of gender. We introduce MORPHOGEN, […]
AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos
arXiv:2604.18993v1 Announce Type: cross Abstract: Perception robustness under adverse weather remains a critical challenge for autonomous driving, with the core bottleneck being the scarcity of real-world video data in adverse weather. Existing weather generation approaches struggle to balance visual quality and annotation reusability. We present AutoAWG, a controllable Adverse Weather video Generation framework for Autonomous […]