• Home
  • Uncategorized
  • Visual Reasoning Agent: Robust Vision Systems in Remote Sensing via Inference-Time Scaling

arXiv:2509.16343v2 Announce Type: replace-cross
Abstract: Building robust vision systems for high-stakes domains such as remote sensing requires stronger visual reasoning than what single-pass inference typically provides; yet, retraining large models is often computationally expensive and data intensive. We present Visual Reasoning Agent (VRA), a training-free agentic visual reasoning framework that orchestrates off-the-shelf large vision-language models (LVLMs) with a large reasoning model (LRM) through an iterative Think-Critique-Act loop for cross-model verification, self-critique, and recursive refinement. On the remote sensing benchmark VRSBench VQA dataset, VRA consistently outperforms multiple standalone LVLM baselines and achieves up to 40.67% improvement on challenging question types spanning both perception and reasoning tasks. In addition, integrating three LVLMs with VRA improves the overall accuracy of the standalone LVLMs from 52.8% to 78.8%, demonstrating the effectiveness of agentic reasoning with increased inference-time compute.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844