Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning from the Self-future: On-policy Self-distillation for dLLMs

Published 16 Jun 2026 in cs.CL | (2606.18195v1)

Abstract: On-policy self-distillation (OPSD) has proven effective for post-training LLMs, yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.

Summary

  • The paper presents d-OPSD, an on-policy self-distillation method that leverages suffix conditioning to match the bidirectional nature of diffusion LLMs.
  • It introduces step-level KL divergence supervision over top-k token updates, aligning training objectives with the inherent diffusion process.
  • Experimental results demonstrate that d-OPSD outperforms SFT and RLVR in reasoning tasks, achieving higher accuracy with significantly fewer optimization steps.

On-Policy Self-Distillation for Diffusion LLMs: The d-OPSD Framework

Introduction

Diffusion-based LLMs (dLLMs) present a compelling alternative to traditional autoregressive (AR) LLMs by leveraging iterative denoising for sequence generation. Unlike AR models with their strict left-to-right conditional dependencies, dLLMs enable arbitrary-order generation and offer significant speed-ups at inference through block-wise and parallel decoding strategies. Recent studies have focused on improving dLLM reasoning capabilities via reinforcement learning-based value regularization (RLVR) and supervised fine-tuning (SFT); however, on-policy distillation techniquesโ€”highly effective in AR modelsโ€”have remained largely unexplored for dLLMs. This work addresses that gap by introducing a tailored on-policy self-distillation (OPSD) scheme, called d-OPSD, specifically aligned with the generation dynamics and bidirectional conditioning capabilities of dLLMs (2606.18195).

Methodological Innovations

Revisiting On-Policy Distillation for dLLMs

OPSD in AR models typically appends privileged information as a left-side prefix, with dense token-level KL divergence supervision. However, this approach is incompatible with dLLMs due to their non-sequential denoising trajectory and capacity for suffix conditioning. The d-OPSD framework introduces two orthogonal innovations tailored for dLLMs:

  1. Self-Teacher Construction via Suffix Conditioning: dLLMs allow exposed tokens anywhere in the sequence and can condition on both prefix and suffix contexts. In d-OPSD, the modelโ€™s own future-generated answers are revealed as a suffix, serving as privileged information for the self-teacher. This leverages the modelโ€™s capacity for p(prefixโˆฃsuffix)p(\text{prefix} | \text{suffix}) conditioning and fully exploits on-policy rollouts without data augmentation. Notably, privileged content is dynamically drawn from on-policy generations rather than static references.
  2. Step-Level Divergence Supervision: Token-level supervision is ill-posed for dLLMs, which jointly reveal multiple tokens per denoising step; some retained, others remasked. d-OPSD introduces step-level KL divergence computed only over the subset of tokens actively updated (top-kk by confidence in each step). This supervision more closely mirrors the Markovian state transitions of the diffusion process, aligning objectives with the intrinsic operational semantics of dLLMs.

Distinctiveness from Off-Policy and AR-Style Distillation

While prior pseudo-trajectory or reference-reveal distillation methods for dLLMs have been proposed, d-OPSD distinguishes itself by being strictly on-policy, deriving privileged information from the current modelโ€™s own sampled trajectory rather than dataset labels or static solutions. This results in higher knowledge transfer and less exposure bias, a challenge that plagues traditional SFT and off-policy methods.

Experimental Analysis

Reasoning Performance

d-OPSD is evaluated on LLaDA-8B-Instruct dLLM across mathematical reasoning (GSM8K, MATH500) and planning (Sudoku 4ร—4, Countdown) tasks. Notable findings:

  • Superior Accuracy: d-OPSD achieves or surpasses SFT and RLVR baselines on all tasks and sequence lengths tested. For instance, on GSM8K with a length of 256, d-OPSD attains 81.0% accuracy versus 78.8% (SFT variant) and 79.8% (RLVR/diffu-GRPO).
  • Sample Efficiency: d-OPSD converges to peak performance in approximately 10% the number of optimization steps required by RLVR baselines, highlighting substantial gains in training efficiency.

Ablation and Diagnostic Studies

  • Self-Teacher Strength: Toy experiments confirm the sufficiency of future-self privileged information, as the self-teacher can โ€œresurrectโ€ correct answers even at modest reveal ratios.
  • Self-Teacher Construction: The AR-style OPSD adaptation for dLLMs, relying on prefix reference appending, underperforms the d-OPSD scheme. Overlap analysis of top-KK predictions reveals that AR-style privileged information yields minimal diversity, while the suffix-future approach transfers more novel knowledge patterns.
  • Divergence Variant: Reverse KL outperforms forward KL, consistent with prior theory emphasizing strong mode-seeking behavior for distillation stability.
  • Retain Ratio and Subset Selection: Performance is robust to the ratio of revealed tokens in the teacher; however, selecting the top-kk positions from the teacherโ€™s own step-wise confidences provides stronger supervision than from the student.
  • Pass@kk Sampling: Even at k=1k=1 (no trajectory rejection), d-OPSD outperforms RLVR, with further gains when pass@kk is enabled during on-policy sampling.

Failure Modes

Despite its advantages, d-OPSD, like RLVR approaches, is susceptible to policy collapse post-convergence. This is hypothesized to result from excessively narrow policy focus due to model-seeking divergence, limiting further learning progress. Robustification against collapse remains an outstanding challenge.

Theoretical and Practical Implications

The introduction of d-OPSD compounds two critical implications for dLLM development:

  • Alignment of Training Objective and Generation Mechanism: By matching training supervision to the dLLMsโ€™ step-level denoising process and leveraging bidirectional conditioning, d-OPSD aligns optimization with generative structure, avoiding the suboptimality that arises from AR-centric distillation.
  • Scalable and Efficient Post-Training: Dense, high-frequency self-teacher supervision dramatically improves training sample efficiency and scalability for post-training dLLMs, facilitating broader adoption and practical deployment for reasoning and planning applications.

Practically, d-OPSD can be incorporated in existing dLLM pipelines with minimal architectural changes, only requiring adaptation of the teacher-student supervision pipeline and loss objective. The approach is agnostic to model size and task domain, subject to the presumption of a suitable diffusion LLM backbone.

Future Directions

Potential avenues for further development include:

  • Stabilization Techniques: Integration of momentum-anchored or entropy-regularized policy strategies to mitigate collapse.
  • Extension to Multi-agent and Multi-task Learning: Leveraging multi-future self-generated privileged information or joint training across tasks/domains.
  • Analysis of Knowledge Transfer Dynamics: Further dissection of โ€œthinking patternโ€ transfer and convergence properties in the absence of static references.
  • Scaling Experiments and Application Expansion: Assessing transferability to giga-scale dLLMs and practical impact on downstream tasks in coding, dialogue, and scientific reasoning.

Conclusion

d-OPSD presents a principled and efficient on-policy self-distillation methodology for dLLMs, reconciling privileged information construction and supervision granularity with the fundamental properties of diffusion-based LLMs. The resultant improvements in reasoning accuracy and training sample efficiency mark a significant evolution in dLLM post-training protocols and lay the groundwork for further research in self-improving diffusion language systems (2606.18195).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 152 likes about this paper.