Wavelet analysis of human recombination rates demonstrates divergence on fine scales

Background: Recombination rates can be estimated across the genome, underpinning genetic analyses such as identification of regions under selection. Accurate recombination mapping requires observing a

Cortical geometry constrains the unimodal anchors of sensory integration

The human cerebral cortex is organized along a unimodal-to-transmodal hierarchy, which provides a putative substrate for the integration of sensory signals from primary cortical fields

Post-translational modifications in the brain are critical contributors to Alzheimers disease neuropathology and cognitive decline

Post-translational modifications (PTMs) in APP and MAPT contribute to plaques and tangles in Alzheimers disease (AD). Yet broader proteome-wide PTMs in the AD brain are

The Amygdalostriatal Transition Area Exhibits Lateral Amygdala-Like Spiking Activity and Tone-Shock Pairing-Induced Plasticity

During Pavlovian fear conditioning, presentation of a conditioned stimulus, such as a tone, together with an unconditioned stimulus, such as an electrical shock, excites neurons

The Logic of Thalamic Inputs onto the Molecular Taxonomy of Cortical Neurons Reveals a Visual Hierarchy

The hierarchical organization of sensory cortices and the rich molecular taxonomy of their cell types are defining features of the mammalian cortex. Cortical areas along

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

June 8, 2026

arXiv:2606.06712v1 Announce Type: cross
Abstract: We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844