arXiv:2510.00915v4 Announce Type: replace-cross
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $,1$, but imperfect verifiers inevitably introduce emphfalse negatives (rejecting correct answers) and emphfalse positives (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $rho_0$ and $rho_1$ — the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a emphbackward correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a emphforward correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.
Crisis support teams’ technological openness and learning attitudes toward the AI based virtual patient system crisis support VR
BackgroundAgainst the backdrop of escalating global humanitarian crises, innovative didactic simulations are becoming increasingly important. A promising alternative to traditional classroom-based didactics for learning psychological