Unburdening healthcare systems through telenursing in chronic respiratory disease management: a systematic review

Background/objectivesChronic respiratory diseases represent a major cause of morbidity/mortality and healthcare expenditure due to disease exacerbations, emergency department (ED) presentations, hospitalizations, and length of stay

Using GPT-4 to annotate the severity of all phenotypic abnormalities within the human phenotype ontology

IntroductionThe Human Phenotype Ontology (HPO) provides a unified framework cataloguing over 17,500 phenotypic abnormalities across more than 8,600 rare diseases, defining hierarchical relationships between them.

Understanding the value of virtual care technologies: development of a framework in the veterans health administration

IntroductionHealthcare systems, including the Veterans Health Administration (VHA), are facing tremendous growth in virtual care technologies that are intended to foster connections between patients, informal

Human-supervised, large language model-based clinical decision support aligned to national newborn protocols in Kenya: a pragmatic, early-stage evaluation

IntroductionTimely, protocol-adherent clinical decisions are crucial for reducing neonatal mortality in low-resource settings. Translating extensive national guidelines into bedside practice remains challenging.ObjectiveWe developed and evaluated

A pilot study of human–AI conversational interaction and its impact on loneliness and wellbeing

IntroductionWith the growing accessibility of advanced artificial intelligence (AI) chatbots, there is a need to understand their impact on users’ psychological wellbeing. This pilot study

Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

May 28, 2026

arXiv:2605.27914v1 Announce Type: cross
Abstract: Subjective evaluation of LLM behavior — empathy, restraint, calibrated emotional tone — is hard. Human inter-rater agreement on such qualities saturates near rho ~ 0.45, and an LLM-as-judge proxy alone risks circularity: a judge sharing the target’s training cohort cannot independently verify it. Anchoring validity to a single human-rater consensus does not extend to capabilities where humans themselves disagree.
We propose a replication-first paradigm: instead of anchoring on one rater group, we certify the instrument via four orthogonal properties — reliability across K runs, cross-instrument replication across architecturally distinct judges, historical-footprint calibration via judges from earlier training cohorts, and pre-registered prediction. We test it on emotional accompaniment by letting the rubric self-evolve data-driven across iterations: the dimensions are not pre-stipulated and the procedure stabilizes to a 9-dimension set. Pre-registration applies to 10 falsifiable hypotheses and 11 forward predictions, committed before any test data was collected.
Applied to 49 models across 8 families, the paradigm surfaces what aggregate scores hide. On advice-restraint — whether a model refrains from giving unsolicited solutions in empathic contexts — gpt-5 falls 1.87 points from gpt-4.1 and Opus-4.7 falls 0.629 from Opus-4.6, while aggregate scores stay flat. The regression survives three user-proxy swaps (95% of magnitude), replicates across a 5-family judge stack and a 17-month cohort gap, and persists on 74 held-out real ESConv conversations (rho in [0.749, 0.850]); the instrument reaches ordinal Krippendorff alpha = 0.91. As a by-product, the paradigm acts as a saturation-source diagnostic, separating instrumental ceilings (breakable by rubric refinement) from structural ceilings (needing scenario or roster intervention).

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844