Behavior change beyond intervention: an activity-theoretical perspective on human-centered design of personal health technology

IntroductionModern personal technologies, such as smartphone apps with artificial intelligence (AI) capabilities, have a significant potential for helping people make necessary changes in their behavior

A data-centric perspective on designing AI foundation models for healthcare

Post Content

Disclosure in the era of generative artificial intelligence

Generative artificial intelligence (AI) has rapidly become embedded in academic writing, assisting with tasks ranging from language editing to drafting text and producing evidence. Despite

What will it take to achieve the End TB targets in South Africa? A mathematical modelling analysis

Background: The WHO End TB strategy targets 80% and 90% reductions in TB incidence and mortality, respectively, between 2015 and 2030. Objective: We assess which

Temporal Dynamics of Cortical State Plasticity Following Adult Vision Loss

The adult brain retains a capacity for adaptation, yet the long-term sequence of plasticity remains poorly understood. To address this, we performed longitudinal mesoscopic calcium

Learning Reasoning Reward Models from Expert Demonstration via Inverse Reinforcement Learning

April 24, 2026

arXiv:2510.01857v3 Announce Type: replace
Abstract: Current approaches to improving reasoning in large language models (LLMs) primarily rely on either supervised fine-tuning (SFT) over expert traces or reinforcement learning (RL) with outcome-level rewards. However, SFT is fundamentally imitative, while outcome-based RL assumes access to a well-specified verifier. To address this gap, we propose an adversarial inverse reinforcement learning (AIRL) framework that learns reasoning rewards directly from expert demonstrations. We evaluate this framework across reward granularities (sparse, interval, and dense). Granularity controls the resolution of credit assignment: sparse rewards emphasise global trajectory quality and training stability, while denser rewards provide higher-resolution step-level supervision for error localisation but are harder to optimise stably. We show that the learned reasoning rewards are useful in three complementary ways. First, as a training signal, they often outperform SFT, with the best variant improving over SFT on medical reasoning (MedReason), mathematics (GSM8K), and challenging scientific question-answering (MMLU-Pro). Second, as an inference-time reranker, they gain up to 17.4 percentage points under a fixed sampling budget. Third, the learned reward transfers across tasks and backbones, suggesting that part of the signal is reusable beyond a single domain or model, and that finer-grained rewards identify the first step at which a trajectory deviates from a correct path. This supports the diagnosis of reasoning failures and the improvement of test-time selection. Together, these results show that AIRL can recover a reusable intermediate reasoning step from demonstrations alone, bridging the gap between pure imitation and reward-driven optimisation for LLM reasoning.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844