Translating AI research into reality: summary of the 2025 voice AI Symposium and Hackathon

The 2025 Voice AI Symposium represented a transition from conceptual research to clinical implementation in vocal biomarker science. Hosted by the NIH-funded Bridge2AI-Voice consortium, the

Decoding perceived risks in online healthcare services: a safety–trust model based on grounded theory

IntroductionThe rapid rise of online healthcare services (OHSs) in China has improved access to medical information and services while creating new uncertainties related to quality,

Anonymization, accountability, and access: legal dimensions of health data sharing in federated networks. Perspectives from empirical study

This paper explores the perspectives of stakeholders involved in federated networks for health data sharing, focusing on the legal and practical dimensions of data protection

AI-enabled cardiovascular devices: a lifecycle playbook for evidence, change control, and post-market assurance

AI-enabled cardiovascular devices are increasingly used in imaging, physiological signal analysis, and clinical decision support systems. Despite growing clinical adoption, requirements for evidence generation, software

From bedside to bytes: the digital transformation of the healthcare workforce

Digital transformation is reshaping healthcare work, whereas research on workforce implications remains fragmented across disciplines. Effects like burnout, resistance, and workflow disruption are often framed

Let’s Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification

March 10, 2026

arXiv:2507.11662v3 Announce Type: replace
Abstract: Verifiers–functions assigning rewards to agent behavior–have been key to AI progress in math, code, and games. However, extending gains to domains without clear-cut success criteria remains a challenge: while humans can recognize desired outcomes, translating this intuition into scalable rules is nontrivial. Multimodal LLMs (MLLMs) offer a promising solution, given their world knowledge, human-preference alignment, and reasoning capabilities. We evaluate MLLM verifiers across web navigation, computer use, and robotics, spanning 13+ models, 28+ designs, and thousands of trajectories from diverse agents. We identify a critical limitation: a strong tendency for MLLMs to over-validate agent behavior–a phenomenon we term agreement bias. This bias is pervasive, resilient to test-time scaling, and can harm applications relying on MLLM judgments/rewards (e.g., self-improvement, steering, online supervision). We discuss several considerations for evaluating and designing MLLM verifiers, and introduce SGV, a lightweight method that better leverages their capabilities by modulating (un)conditional generation. First, an MLLM is elicited to generate broad priors about desired behavior, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Our methods yield more human-aligned verifiers, improving failure detection by 25pp and accuracy by 14pp. In self-improvement and online supervision, they boost task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena–surpassing the previous state of the art by 20pp. As a byproduct, we release an update of VisualWebArena featuring strong agent baselines, more human-aligned oracles, container parallelism with high fidelity and proper resets, >10x speedups, and VWA-Lite, a 1/3 subset with comparable evaluation fidelity.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844