arXiv:2604.04532v1 Announce Type: cross
Abstract: Evaluation language is typically treated as a fixed English default in agentic code benchmarks, yet we show that changing the judge’s language can invert backbone rankings. We localize the Agent-as-a-Judge prompt stack to five typologically diverse languages (English, Arabic, Turkish, Chinese, Hindi) and evaluate 55 DevAI development tasks across three developer-agent frameworks and six judge backbones, totaling 4950 judge runs. The central finding is that backbone and language interact: GPT-4o achieves the highest satisfaction in English (44.72%), while Gemini leads in Arabic (51.72%, $p<0.001$ vs. GPT-4o) and Hindi (53.22%). No single backbone dominates across all languages, and inter-backbone agreement on individual requirement judgments is modest (Fleiss’ $kappa leq 0.231$). A controlled ablation further shows that localizing judge-side instructions, not just benchmark content, can be decisive: Hindi satisfaction drops from 42.8% to 23.2% under partial localization. These results indicate that language should be treated as an explicit evaluation variable in agentic benchmarks. Full requirement-level judgments and runtime statistics are released for reproducibility.
Dysregulation of Hippo Signaling Pathway as a Convergent Mechanism Underlying Choroid Plexus Defects in Bipolar Disorder
Bipolar disorder (BD) is a prevalent and highly heritable psychiatric condition. Developmental mechanisms are implicated but the specific molecular origins remain unclear. The choroid plexus

