arXiv:2605.27914v1 Announce Type: cross
Abstract: Subjective evaluation of LLM behavior — empathy, restraint, calibrated emotional tone — is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge sharing the target’s training cohort cannot independently verify it. Anchoring validity to a single human-rater consensus does not extend to capabilities where humans themselves disagree.
We propose a replication-first paradigm: instead of anchoring on one rater group, we certify the instrument via four orthogonal properties — reliability across K runs, cross-instrument replication across architecturally distinct judges, historical-footprint calibration via judges from earlier training cohorts, and pre-registered prediction. We test it on emotional accompaniment by letting the rubric self-evolve data-driven across iterations: the dimensions are not pre-stipulated and the procedure stabilizes to a 9-dimension set. Pre-registration applies to 10 falsifiable hypotheses and 11 forward predictions, committed before any test data was collected.
Applied to 49 models across 8 families, the paradigm surfaces what aggregate scores hide. On advice-restraint — whether a model refrains from giving unsolicited solutions in empathic contexts — gpt-5 falls 1.87 points from gpt-4.1 and Opus-4.7 falls 0.629 from Opus-4.6, while aggregate scores stay flat. The regression survives three user-proxy swaps (95% of magnitude), replicates across a 5-family judge stack and a 17-month cohort gap, and persists on 74 held-out real ESConv conversations (rho in [0.749, 0.850]); the instrument reaches ordinal Krippendorff alpha = 0.91. As a by-product, the paradigm acts as a saturation-source diagnostic, separating instrumental ceilings (breakable by rubric refinement) from structural ceilings (needing scenario or roster intervention).
Unburdening healthcare systems through telenursing in chronic respiratory disease management: a systematic review
Background/objectivesChronic respiratory diseases represent a major cause of morbidity/mortality and healthcare expenditure due to disease exacerbations, emergency department (ED) presentations, hospitalizations, and length of stay