Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation

arXiv:2603.17295v1 Announce Type: cross Abstract: Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and

Towards Unsupervised Adversarial Document Detection in Retrieval Augmented Generation Systems

arXiv:2603.17176v1 Announce Type: cross Abstract: Retrieval augmented generation systems have become an integral part of everyday life. Whether in internet search engines, email systems, or

SA-CycleGAN-2.5D: Self-Attention CycleGAN with Tri-Planar Context for Multi-Site MRI Harmonization

arXiv:2603.17219v1 Announce Type: cross Abstract: Multi-site neuroimaging analysis is fundamentally confounded by scanner-induced covariate shifts, where the marginal distribution of voxel intensities $P(mathbfx)$ varies non-linearly

Dependence Fidelity and Downstream Inference Stability in Generative Models

arXiv:2603.17041v1 Announce Type: cross Abstract: Recent advances in generative AI have led to increasingly realistic synthetic data, yet evaluation criteria remain focused on marginal distribution

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

arXiv:2603.17111v1 Announce Type: cross Abstract: Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors

Visual Prompt Discovery via Semantic Exploration

March 18, 2026

arXiv:2603.16250v1 Announce Type: cross
Abstract: LVLMs encounter significant challenges in image understanding and visual reasoning, leading to critical perception failures. Visual prompts, which incorporate image manipulation code, have shown promising potential in mitigating these issues. While emerged as a promising direction, previous methods for visual prompt generation have focused on tool selection rather than diagnosing and mitigating the root causes of LVLM perception failures. Because of the opacity and unpredictability of LVLMs, optimal visual prompts must be discovered through empirical experiments, which have relied on manual human trial-and-error.
We propose an automated semantic exploration framework for discovering task-wise visual prompts. Our approach enables diverse yet efficient exploration through agent-driven experiments, minimizing human intervention and avoiding the inefficiency of per-sample generation. We introduce a semantic exploration algorithm named SEVEX, which addresses two major challenges of visual prompt exploration: (1) the distraction caused by lengthy, low-level code and (2) the vast, unstructured search space of visual prompts. Specifically, our method leverages an abstract idea space as a search space, a novelty-guided selection algorithm, and a semantic feedback-driven ideation process to efficiently explore diverse visual prompts based on empirical results.
We evaluate SEVEX on the BlindTest and BLINK benchmarks, which are designed to assess LVLM perception. Experimental results demonstrate that SEVEX significantly outperforms baseline methods in task accuracy, inference efficiency, exploration efficiency, and exploration stability. Notably, our framework discovers sophisticated and counter-intuitive visual strategies that go beyond conventional tool usage, offering a new paradigm for enhancing LVLM perception through automated, task-wise visual prompts.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844