Learning from the Self-future: On-policy Self-distillation for dLLMs

Published 16 Jun 2026 in cs.CL | (2606.18195v1)

Abstract: On-policy self-distillation (OPSD) has proven effective for post-training LLMs, yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents d-OPSD, an on-policy self-distillation method that leverages suffix conditioning to match the bidirectional nature of diffusion LLMs.
It introduces step-level KL divergence supervision over top-k token updates, aligning training objectives with the inherent diffusion process.
Experimental results demonstrate that d-OPSD outperforms SFT and RLVR in reasoning tasks, achieving higher accuracy with significantly fewer optimization steps.

On-Policy Self-Distillation for Diffusion LLMs: The d-OPSD Framework

Introduction

Diffusion-based LLMs (dLLMs) present a compelling alternative to traditional autoregressive (AR) LLMs by leveraging iterative denoising for sequence generation. Unlike AR models with their strict left-to-right conditional dependencies, dLLMs enable arbitrary-order generation and offer significant speed-ups at inference through block-wise and parallel decoding strategies. Recent studies have focused on improving dLLM reasoning capabilities via reinforcement learning-based value regularization (RLVR) and supervised fine-tuning (SFT); however, on-policy distillation techniques—highly effective in AR models—have remained largely unexplored for dLLMs. This work addresses that gap by introducing a tailored on-policy self-distillation (OPSD) scheme, called d-OPSD, specifically aligned with the generation dynamics and bidirectional conditioning capabilities of dLLMs (2606.18195).

Methodological Innovations

Revisiting On-Policy Distillation for dLLMs

OPSD in AR models typically appends privileged information as a left-side prefix, with dense token-level KL divergence supervision. However, this approach is incompatible with dLLMs due to their non-sequential denoising trajectory and capacity for suffix conditioning. The d-OPSD framework introduces two orthogonal innovations tailored for dLLMs:

Self-Teacher Construction via Suffix Conditioning: dLLMs allow exposed tokens anywhere in the sequence and can condition on both prefix and suffix contexts. In d-OPSD, the model’s own future-generated answers are revealed as a suffix, serving as privileged information for the self-teacher. This leverages the model’s capacity for $p(\text{prefix} | \text{suffix})$ conditioning and fully exploits on-policy rollouts without data augmentation. Notably, privileged content is dynamically drawn from on-policy generations rather than static references.
Step-Level Divergence Supervision: Token-level supervision is ill-posed for dLLMs, which jointly reveal multiple tokens per denoising step; some retained, others remasked. d-OPSD introduces step-level KL divergence computed only over the subset of tokens actively updated (top- $k$ by confidence in each step). This supervision more closely mirrors the Markovian state transitions of the diffusion process, aligning objectives with the intrinsic operational semantics of dLLMs.

Distinctiveness from Off-Policy and AR-Style Distillation

While prior pseudo-trajectory or reference-reveal distillation methods for dLLMs have been proposed, d-OPSD distinguishes itself by being strictly on-policy, deriving privileged information from the current model’s own sampled trajectory rather than dataset labels or static solutions. This results in higher knowledge transfer and less exposure bias, a challenge that plagues traditional SFT and off-policy methods.

Experimental Analysis

Reasoning Performance

d-OPSD is evaluated on LLaDA-8B-Instruct dLLM across mathematical reasoning (GSM8K, MATH500) and planning (Sudoku 4×4, Countdown) tasks. Notable findings:

Superior Accuracy: d-OPSD achieves or surpasses SFT and RLVR baselines on all tasks and sequence lengths tested. For instance, on GSM8K with a length of 256, d-OPSD attains 81.0% accuracy versus 78.8% (SFT variant) and 79.8% (RLVR/diffu-GRPO).
Sample Efficiency: d-OPSD converges to peak performance in approximately 10% the number of optimization steps required by RLVR baselines, highlighting substantial gains in training efficiency.

Ablation and Diagnostic Studies

Self-Teacher Strength: Toy experiments confirm the sufficiency of future-self privileged information, as the self-teacher can “resurrect” correct answers even at modest reveal ratios.
Self-Teacher Construction: The AR-style OPSD adaptation for dLLMs, relying on prefix reference appending, underperforms the d-OPSD scheme. Overlap analysis of top- $K$ predictions reveals that AR-style privileged information yields minimal diversity, while the suffix-future approach transfers more novel knowledge patterns.
Divergence Variant: Reverse KL outperforms forward KL, consistent with prior theory emphasizing strong mode-seeking behavior for distillation stability.
Retain Ratio and Subset Selection: Performance is robust to the ratio of revealed tokens in the teacher; however, selecting the top- $k$ positions from the teacher’s own step-wise confidences provides stronger supervision than from the student.
Pass@ $k$ Sampling: Even at $k=1$ (no trajectory rejection), d-OPSD outperforms RLVR, with further gains when pass@ $k$ is enabled during on-policy sampling.

Failure Modes

Despite its advantages, d-OPSD, like RLVR approaches, is susceptible to policy collapse post-convergence. This is hypothesized to result from excessively narrow policy focus due to model-seeking divergence, limiting further learning progress. Robustification against collapse remains an outstanding challenge.

Theoretical and Practical Implications

The introduction of d-OPSD compounds two critical implications for dLLM development:

Alignment of Training Objective and Generation Mechanism: By matching training supervision to the dLLMs’ step-level denoising process and leveraging bidirectional conditioning, d-OPSD aligns optimization with generative structure, avoiding the suboptimality that arises from AR-centric distillation.
Scalable and Efficient Post-Training: Dense, high-frequency self-teacher supervision dramatically improves training sample efficiency and scalability for post-training dLLMs, facilitating broader adoption and practical deployment for reasoning and planning applications.

Practically, d-OPSD can be incorporated in existing dLLM pipelines with minimal architectural changes, only requiring adaptation of the teacher-student supervision pipeline and loss objective. The approach is agnostic to model size and task domain, subject to the presumption of a suitable diffusion LLM backbone.

Future Directions

Potential avenues for further development include:

Stabilization Techniques: Integration of momentum-anchored or entropy-regularized policy strategies to mitigate collapse.
Extension to Multi-agent and Multi-task Learning: Leveraging multi-future self-generated privileged information or joint training across tasks/domains.
Analysis of Knowledge Transfer Dynamics: Further dissection of “thinking pattern” transfer and convergence properties in the absence of static references.
Scaling Experiments and Application Expansion: Assessing transferability to giga-scale dLLMs and practical impact on downstream tasks in coding, dialogue, and scientific reasoning.

Conclusion

d-OPSD presents a principled and efficient on-policy self-distillation methodology for dLLMs, reconciling privileged information construction and supervision granularity with the fundamental properties of diffusion-based LLMs. The resultant improvements in reasoning accuracy and training sample efficiency mark a significant evolution in dLLM post-training protocols and lay the groundwork for further research in self-improving diffusion language systems (2606.18195).

Markdown Report Issue