- The paper presents d-OPSD, an on-policy self-distillation method that leverages suffix conditioning to match the bidirectional nature of diffusion LLMs.
- It introduces step-level KL divergence supervision over top-k token updates, aligning training objectives with the inherent diffusion process.
- Experimental results demonstrate that d-OPSD outperforms SFT and RLVR in reasoning tasks, achieving higher accuracy with significantly fewer optimization steps.
On-Policy Self-Distillation for Diffusion LLMs: The d-OPSD Framework
Introduction
Diffusion-based LLMs (dLLMs) present a compelling alternative to traditional autoregressive (AR) LLMs by leveraging iterative denoising for sequence generation. Unlike AR models with their strict left-to-right conditional dependencies, dLLMs enable arbitrary-order generation and offer significant speed-ups at inference through block-wise and parallel decoding strategies. Recent studies have focused on improving dLLM reasoning capabilities via reinforcement learning-based value regularization (RLVR) and supervised fine-tuning (SFT); however, on-policy distillation techniquesโhighly effective in AR modelsโhave remained largely unexplored for dLLMs. This work addresses that gap by introducing a tailored on-policy self-distillation (OPSD) scheme, called d-OPSD, specifically aligned with the generation dynamics and bidirectional conditioning capabilities of dLLMs (2606.18195).
Methodological Innovations
Revisiting On-Policy Distillation for dLLMs
OPSD in AR models typically appends privileged information as a left-side prefix, with dense token-level KL divergence supervision. However, this approach is incompatible with dLLMs due to their non-sequential denoising trajectory and capacity for suffix conditioning. The d-OPSD framework introduces two orthogonal innovations tailored for dLLMs:
- Self-Teacher Construction via Suffix Conditioning: dLLMs allow exposed tokens anywhere in the sequence and can condition on both prefix and suffix contexts. In d-OPSD, the modelโs own future-generated answers are revealed as a suffix, serving as privileged information for the self-teacher. This leverages the modelโs capacity for p(prefixโฃsuffix) conditioning and fully exploits on-policy rollouts without data augmentation. Notably, privileged content is dynamically drawn from on-policy generations rather than static references.
- Step-Level Divergence Supervision: Token-level supervision is ill-posed for dLLMs, which jointly reveal multiple tokens per denoising step; some retained, others remasked. d-OPSD introduces step-level KL divergence computed only over the subset of tokens actively updated (top-k by confidence in each step). This supervision more closely mirrors the Markovian state transitions of the diffusion process, aligning objectives with the intrinsic operational semantics of dLLMs.
Distinctiveness from Off-Policy and AR-Style Distillation
While prior pseudo-trajectory or reference-reveal distillation methods for dLLMs have been proposed, d-OPSD distinguishes itself by being strictly on-policy, deriving privileged information from the current modelโs own sampled trajectory rather than dataset labels or static solutions. This results in higher knowledge transfer and less exposure bias, a challenge that plagues traditional SFT and off-policy methods.
Experimental Analysis
d-OPSD is evaluated on LLaDA-8B-Instruct dLLM across mathematical reasoning (GSM8K, MATH500) and planning (Sudoku 4ร4, Countdown) tasks. Notable findings:
- Superior Accuracy: d-OPSD achieves or surpasses SFT and RLVR baselines on all tasks and sequence lengths tested. For instance, on GSM8K with a length of 256, d-OPSD attains 81.0% accuracy versus 78.8% (SFT variant) and 79.8% (RLVR/diffu-GRPO).
- Sample Efficiency: d-OPSD converges to peak performance in approximately 10% the number of optimization steps required by RLVR baselines, highlighting substantial gains in training efficiency.
Ablation and Diagnostic Studies
- Self-Teacher Strength: Toy experiments confirm the sufficiency of future-self privileged information, as the self-teacher can โresurrectโ correct answers even at modest reveal ratios.
- Self-Teacher Construction: The AR-style OPSD adaptation for dLLMs, relying on prefix reference appending, underperforms the d-OPSD scheme. Overlap analysis of top-K predictions reveals that AR-style privileged information yields minimal diversity, while the suffix-future approach transfers more novel knowledge patterns.
- Divergence Variant: Reverse KL outperforms forward KL, consistent with prior theory emphasizing strong mode-seeking behavior for distillation stability.
- Retain Ratio and Subset Selection: Performance is robust to the ratio of revealed tokens in the teacher; however, selecting the top-k positions from the teacherโs own step-wise confidences provides stronger supervision than from the student.
- Pass@k Sampling: Even at k=1 (no trajectory rejection), d-OPSD outperforms RLVR, with further gains when pass@k is enabled during on-policy sampling.
Failure Modes
Despite its advantages, d-OPSD, like RLVR approaches, is susceptible to policy collapse post-convergence. This is hypothesized to result from excessively narrow policy focus due to model-seeking divergence, limiting further learning progress. Robustification against collapse remains an outstanding challenge.
Theoretical and Practical Implications
The introduction of d-OPSD compounds two critical implications for dLLM development:
- Alignment of Training Objective and Generation Mechanism: By matching training supervision to the dLLMsโ step-level denoising process and leveraging bidirectional conditioning, d-OPSD aligns optimization with generative structure, avoiding the suboptimality that arises from AR-centric distillation.
- Scalable and Efficient Post-Training: Dense, high-frequency self-teacher supervision dramatically improves training sample efficiency and scalability for post-training dLLMs, facilitating broader adoption and practical deployment for reasoning and planning applications.
Practically, d-OPSD can be incorporated in existing dLLM pipelines with minimal architectural changes, only requiring adaptation of the teacher-student supervision pipeline and loss objective. The approach is agnostic to model size and task domain, subject to the presumption of a suitable diffusion LLM backbone.
Future Directions
Potential avenues for further development include:
- Stabilization Techniques: Integration of momentum-anchored or entropy-regularized policy strategies to mitigate collapse.
- Extension to Multi-agent and Multi-task Learning: Leveraging multi-future self-generated privileged information or joint training across tasks/domains.
- Analysis of Knowledge Transfer Dynamics: Further dissection of โthinking patternโ transfer and convergence properties in the absence of static references.
- Scaling Experiments and Application Expansion: Assessing transferability to giga-scale dLLMs and practical impact on downstream tasks in coding, dialogue, and scientific reasoning.
Conclusion
d-OPSD presents a principled and efficient on-policy self-distillation methodology for dLLMs, reconciling privileged information construction and supervision granularity with the fundamental properties of diffusion-based LLMs. The resultant improvements in reasoning accuracy and training sample efficiency mark a significant evolution in dLLM post-training protocols and lay the groundwork for further research in self-improving diffusion language systems (2606.18195).