From pilot to policy: why AI health interventions fail to scale in developing countries

Post Content

Infectious disease burden and surveillance challenges in Jordan and Palestine: a systematic review and meta-analysis

BackgroundJordan and Palestine face public health challenges due to infectious diseases, with the added detrimental factors of long-term conflict, forced relocation, and lack of resources.

Patterns and Characteristics of Mobile App Use to Promote Wellness and Manage Illness: Cross-Sectional Study

Background: Mobile health (mHealth) apps target diverse health behaviors, but engagement may vary by purpose. Objective: This study examined the prevalence, usage patterns, and user

One Token Is Enough: Improving Diffusion Language Models with a Sink Token

arXiv:2601.19657v2 Announce Type: replace-cross Abstract: Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance.

Trustworthy Intelligent Education: A Systematic Perspective on Progress, Challenges, and Future Directions

arXiv:2601.21837v1 Announce Type: cross Abstract: In recent years, trustworthiness has garnered increasing attention and exploration in the field of intelligent education, due to the inherent

Pretrain Value, Not Reward: Decoupled Value Policy Optimization

January 27, 2026

arXiv:2502.16944v2 Announce Type: replace-cross
Abstract: In this paper, we explore how directly pretraining a value model simplifies and stabilizes reinforcement learning from human feedback (RLHF). In reinforcement learning, value estimation is the key to policy optimization, distinct from reward supervision. The value function predicts the emphreturn-to-go of a partial answer, that is, how promising the partial answer is if it were continued to completion. In RLHF, however, the standard pipeline first pretrains a reward model and then learns a value function online, even though no new reward signals are available once preference data is collected. This makes critic learning redundant, as the process of training a reward model and then deriving a value model is informationally equivalent to directly pretraining a value model. Importantly, this requires no additional supervision, and our value model is trained on exactly the same data used for reward modeling. Building on this insight, we introduce emphDecoupled Value Policy Optimization (DVPO), a framework that pretrains a emphGlobal Value Model (GVM) offline and freezes it as a universal critic for policy learning. The GVM provides stable, fine-grained credit assignment without critic drift or trajectory sampling. Experiments across MT-Bench, Alpaca-Eval, and Arena-Hard demonstrate that DVPO matches or surpasses state-of-the-art RLHF methods. These results highlight RLHF can be reframed as policy-only optimization guided by a single pretrained value model.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registeration number 16808844