arXiv:2605.04083v1 Announce Type: cross
Abstract: Much of the focus in RL today is on evaluation design: building meaningful evals that serve simultaneously as benchmarks and as well-defined reward signals for post-training. Yet, many real-world tasks are governed by subjective, procedural, and domain-specific requirements that are difficult to encode as exact-match targets or open-ended preference judgments frequently used in RL pipelines today. In this work, we present AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals. AsymmetryZero represents each task as a stable evaluation contract that makes grading criteria explicit: what is being graded, how each criterion is judged, and how criterion-level decisions are aggregated into a task outcome. The same contract can be executed using Inspect for model-only evaluations, as well as the Harbor Framework for agentic evaluations, enabling comparable scores and shared audit artifacts across both settings. We argue that the central challenge in post-training today is the faithful encoding of expert requirements into the evaluation itself. To that end, we present a study using Harbor that holds task contracts fixed and compares a five-model frontier jury against a five-model compact jury across four frontier-class solvers (Claude Opus 4.6, GPT-5.4, Grok-4.20, Gemini-3.1-Pro). We find that criterion-level frontier-vs-compact agreement ranges from $75.9%$ to $89.6%$ (strict common-subset agreement: $77.8%$ to $92.1%$), while compact juries exhibit substantially higher internal dissent (3–2 split rate $28.7%$–$32.4%$) than frontier juries ($6.1%$–$11.5%$). Verifier traces further show that compact juries reduce per-criterion judging cost to roughly $4.2%$–$5.6%$ of frontier and latency to roughly $21.7%$–$27.1%$, even as aggregated task-level outcomes often remain comparatively stable.
Development of reconfigurable smart medical wards using integrated components and complex features
Patient treatment in hospitals requires their regular monitoring to assess their health conditions. At the same time, routine measurements are often delayed, missed, or not