Learning to lead in a hybrid human-AI enterprise

Learning to lead in a hybrid human-AI enterprise

As adoption of AI agents looks set to surge by as much as 300% in the next two years, leadership teams are carefully considering the implications

David Sinclair plans to test whole-body rejuvenation drugs in the XPrize competition

The outspoken longevity scientist David Sinclair has been predicting that one day, you’ll go to the doctor and get a prescription that will make you

Five things you need to know about AI

At SXSW London last week I gave a talk called “Five things you need to know about AI,” in which I shared what I think

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

arXiv:2606.07618v1 Announce Type: cross Abstract: NVFP4 is a recently introduced hardware-supported FP4 format that improves the fidelity of 4-bit quantization through fine-grained block scales. However,

ViMax: Agentic Video Generation

arXiv:2606.07649v1 Announce Type: cross Abstract: Long-form video generation requires systematic narrative planning and visual consistency that current short-clip methods cannot provide. Existing methods generate isolated

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

June 9, 2026

arXiv:2509.25004v2 Announce Type: replace
Abstract: Online reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning abilities of large language models, but most methods still optimize reasoning trajectories over the static problem set, wasting rollout budget on solved or overly difficult problems. We propose textbfCLPO (Curriculum Learning meets Policy Optimization), a self-evolving curriculum framework that uses on-policy rollout accuracy to identify solved, medium-difficulty, and hard problems, then restructures selected tasks according to the model’s current capability. Hard problems are simplified to become learnable, while medium-difficulty problems are diversified to provide useful training variation. This allows the learning curriculum to co-evolve with the policy rather than remaining fixed as the model’s capability boundary shifts. Rather than treating these rewrites as static data augmentation, CLPO optimizes restructuring trajectories with credit assigned by the downstream accuracy gain of the rewritten problem, requiring no additional human annotations beyond the original verifiable answers. Experiments across mathematical reasoning and out-of-domain general reasoning benchmarks show that CLPO substantially outperforms GRPO and DAPO on Qwen3-8B by 10.21 and 7.75 average points, respectively. Ablation studies on math and code domains further show that both the restructuring mode and the rewriting loss contribute to the final gains, demonstrating that CLPO provides a scalable and robust pathway for eliciting stronger reasoning capabilities through a self-evolving curriculum.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844