Measuring and Exploiting Confirmation Bias in LLM-Assisted Security Code Review

arXiv:2603.18740v1 Announce Type: cross Abstract: Security code reviews increasingly rely on systems integrating Large Language Models (LLMs), ranging from interactive assistants to autonomous agents in

Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

arXiv:2603.18911v1 Announce Type: cross Abstract: Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches

Page image classification for content-specific data processing

arXiv:2507.21114v2 Announce Type: replace-cross Abstract: Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting

Sheaf Neural Networks and biomedical applications

arXiv:2602.00159v2 Announce Type: replace-cross Abstract: The purpose of this paper is to elucidate the theory and mathematical modelling behind the sheaf neural network (SNN) algorithm

How Uncertainty Estimation Scales with Sampling in Reasoning Models

arXiv:2603.19118v1 Announce Type: new Abstract: Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel

EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing

March 19, 2026

arXiv:2509.13399v3 Announce Type: replace-cross
Abstract: Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images, resulting in limited coverage and inheriting biases from prior generative models or (ii) rely solely on zero-shot vision language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, EdiVal first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object centric metrics tailored for multi turn evaluation and one global metric of visual quality: 1) EdiVal-IF, which measures instruction following by combining open vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector guided crops; 2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and 3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models. Instantiating this pipeline, we build EdiVal Bench, a multi-turn editing benchmark covering 9 instruction types and 16 state-of-the-art editing models, spanning in-context, flow-matching, and diffusion paradigms. We demonstrate that EdiVal can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844