• Home
  • AI/ML & Advanced Analytics
  • Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

arXiv:2604.04532v1 Announce Type: cross
Abstract: Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge’s language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72%), while Gemini leads in Arabic (51.72%, $p<0.001$ vs. GPT-4o) and Hindi (53.22%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss’ $kappa leq 0.231$). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8% to 23.2% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd.   dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844