From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring

arXiv:2603.23990v1 Announce Type: cross Abstract: Monolithic Large Language Models (LLMs) used in educational dialogue often behave as “black boxes,” where pedagogical decisions are implicit and

Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning

arXiv:2603.24083v1 Announce Type: cross Abstract: This paper introduces Knowledge Graph based Massively Multi-task Model-based Policy Optimization (KG-M3PO), a framework for multi-task robotic manipulation in partially

Can VLMs Reason Robustly? A Neuro-Symbolic Investigation

arXiv:2603.23867v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have been applied to a wide range of reasoning tasks, yet it remains unclear whether they can

DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning

arXiv:2603.23916v1 Announce Type: cross Abstract: Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings,

An In-Depth Study of Filter-Agnostic Vector Search on a PostgreSQL Database System: [Experiments and Analysis]

arXiv:2603.23710v1 Announce Type: cross Abstract: Filtered Vector Search (FVS) is critical for supporting semantic search and GenAI applications in modern database systems. However, existing research

Diffusion Reinforcement Learning via Centered Reward Distillation

March 17, 2026

arXiv:2603.14128v1 Announce Type: cross
Abstract: Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present textbfCentered Reward Distillation (CRD), a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under emphwithin-prompt centering, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (textiti) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (textitii) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (textitiii) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with textttGenEval and textttOCR rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844