On-Policy Distillation Explained

Updated 3 July 2026

On-policy distillation is a training paradigm where the student model is refined using dense, token-level supervision on its own trajectories, addressing exposure bias.
It minimizes divergence between student and teacher distributions during on-policy rollouts, thereby reducing compounding errors and enhancing model alignment and capacity transfer.
Dual On-Policy Distillation (DOPD) further improves performance by dynamically routing token-level feedback based on advantage gaps and confidence, adapting to both language and vision-language tasks.

On-policy distillation (OPD) is a post-training paradigm in which a student model is optimized based on dense, token-level supervision from a stronger policy—either an external teacher, a self-teacher, or a hybrid—applied to trajectories sampled from the student's own policy. This approach fundamentally differs from off-policy knowledge distillation by providing the student with feedback on states it actually visits, rather than on fixed teacher-generated traces. OPD encompasses a spectrum of objectives, teacher access modes, and feedback signals, and has catalyzed rapid advances in model alignment, capacity transfer, and efficient reasoning across large-scale language and vision-LLMs.

1. Foundations and Core Principles

At its core, OPD defines an autoregressive student policy $\pi_\theta(y|x) = \prod_{t=1}^T \pi_\theta(y_t|x,y_{<t})$ and minimizes a divergence—commonly the reverse KL—between the student's trajectory distribution and a teacher policy (which may receive privileged context):

$\mathcal{L}_{\rm OPD}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot|x)} \left[\sum_{t=1}^T D_{\rm KL}\big(\pi_\theta(\cdot|x,y_{<t}) \, \| \, \pi^*(\cdot|x,y_{<t})\big)\right].$

By sampling rollouts on-policy from $\pi_\theta$ , OPD eliminates exposure bias—where off-policy students never learn to recover from their own mistakes. Iteratively updating the student on distributions it generates itself, OPD reduces the theoretical compounding error rate from $O(\epsilon T^2)$ (off-policy) to $O(\epsilon T)$ , consistent with DAgger-style analysis (Song et al., 1 Apr 2026).

Dense supervision is typically at the token or chunk level, with the teacher providing probabilities, preferences, or structured feedback on each student-induced state. This process generalizes to self-distillation, context distillation, and various reward-regularized extensions.

2. Privileged Information and the "Privilege Illusion"

To expand the performance frontier, practices have emerged for infusing privileged information into either the teacher or student policy during training—examples include chain-of-thought, verified hints, tool descriptions, or bounding box annotations (Yu et al., 29 Jun 2026). However, naive application of OPD under contextual privilege leads to the "privilege illusion": students appear to close the teacher–student gap during training by mimicking privileged cues unavailable at test time, rather than acquiring transferable capability.

The privilege illusion arises due to two factors:

Information asymmetry: Tokens predicted with high confidence by a privileged teacher may encode shortcuts or context-specific artifacts not reproducible in deployment.
Non-uniform token-level signal: Only a minority of tokens along a trajectory are truly capability-bearing; indiscriminate distillation over all tokens risks overfitting privilege-correlated behaviors and collapsing policy diversity.

Empirical analyses reveal entropy collapse, reduced exploration, and degraded generalization when all tokens are treated equally under dense privileged supervision (Yu et al., 29 Jun 2026).

3. Dual On-Policy Distillation: DOPD Paradigm

DOPD (Dual On-Policy Distillation) introduces an advantage-aware, token-wise distillation mechanism that routes supervision between a privileged teacher and a privileged student branch based on their relative advantage and token-wise confidences (Yu et al., 29 Jun 2026).

For each student-sampled trajectory and token $n$ , DOPD computes:

$\mathcal{A}_n = |\log q_T - \log q_S| = \Big|\ln\frac{q_T}{q_S}\Big|$

where $q_T$ and $q_S$ are the privileged teacher and privileged student probabilities for $y_n$ given the contextual inputs (including privilege).

DOPD assigns each token to one of four regimes using empirical batch means $\mathcal{L}_{\rm OPD}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot|x)} \left[\sum_{t=1}^T D_{\rm KL}\big(\pi_\theta(\cdot|x,y_{<t}) \, \| \, \pi^*(\cdot|x,y_{<t})\big)\right].$ 0, $\mathcal{L}_{\rm OPD}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot|x)} \left[\sum_{t=1}^T D_{\rm KL}\big(\pi_\theta(\cdot|x,y_{<t}) \, \| \, \pi^*(\cdot|x,y_{<t})\big)\right].$ 1, $\mathcal{L}_{\rm OPD}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot|x)} \left[\sum_{t=1}^T D_{\rm KL}\big(\pi_\theta(\cdot|x,y_{<t}) \, \| \, \pi^*(\cdot|x,y_{<t})\big)\right].$ 2:

Low-Gap, High-Conf (LH): Teacher and student both confident with small gap. Weak teacher distillation via light top- $\mathcal{L}_{\rm OPD}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot|x)} \left[\sum_{t=1}^T D_{\rm KL}\big(\pi_\theta(\cdot|x,y_{<t}) \, \| \, \pi^*(\cdot|x,y_{<t})\big)\right].$ 3 reverse KL.
Low-Gap, Low-Conf (LL): Both uncertain. Self-regularization toward privileged student output.
High-Gap, Teacher-Favored (HT): Large gap, high teacher confidence. Strong, full-vocabulary JS divergence distillation.
High-Gap, Student-Favored (HS): Large gap, but student more confident. Light self-regularization.

This regime-routing ensures that genuine capability-bearing tokens (high advantage gap, high teacher confidence) receive maximal supervision, while privilege-dominant or noisy tokens receive lighter or self-distillation signals.

The final per-token DOPD objective is:

with the global loss averaged over on-policy rollouts.

This architecture generalizes to vision-language and other multimodal settings, where, for example, object labels or bounding boxes serve as privileged context during DOPD (Yu et al., 29 Jun 2026).

4. Empirical Performance, Robustness, and Scalability

Extensive experiments validate DOPD across LLM and VLM tasks using multiple teacher–student scale pairs (up to 8B→0.6B) and diverse privileged information:

Setting	Vanilla OPD	DOPD	Gap Recovery
LLM (Qwen3)	43.9	51.4	89.8%
VLM (Qwen3-VL)	52.4	58.4	69.2%

DOPD consistently surpasses standard OPD and recent variants (ExOPD, Uni-OPD, EOPD, Self-Distillation) by large margins (4–8 points absolute), both in matched-scale and cross-scale settings.
DOPD's gains increase as the teacher–student size ratio grows, maintaining robustness and stability where vanilla OPD plateaus or regresses.
DOPD shows superior continual learning (retains nearly all acquired abilities across multi-stage curricula) and improved out-of-distribution generalization; e.g., training on reasoning, testing on coding yields 3–4 point improvements over leading baselines.
Policy entropy curves under DOPD exhibit an initial controlled rise and moderate decay, avoiding the sharp entropy collapse seen in self-distillation, indicating healthy exploration and stable convergence.
Ablations confirm that top-advantage tokens account for 50–80% of DOPD's relative performance; restricting strong teacher imitation to these tokens yields most of the measured gain.

5. Methodological Innovations and Theoretical Underpinnings

DOPD's core innovation is token-wise regime routing, justified by the observation that only tokens with high privilege advantage correspond to transferable capabilities. Unlike uniform KL minimization, DOPD's adaptive strategy mitigates privilege illusion by:

Decoupling information-asymmetry from real competence: tokens where both privileged teacher and student agree (low gap) are likely privilege-dominated and do not warrant strong distillation.
Assigning regime-specific divergence: strong teacher imitation (full-vocab JS or KL) for high-gap, teacher-favored tokens; lightweight self-regularization elsewhere.
Empirical evidence indicates this approach captures nearly all the possible teacher–student gap, and ablation studies demonstrate its necessity for effective, robust distillation (Yu et al., 29 Jun 2026).

DOPD is algorithmically compatible with existing OPD architectures: tokens are dynamically classified per batch, with regime-specific loss terms aggregated in the main optimization loop.

6. Broader Implications and Generalization

The DOPD framework—optimal routing of token-level supervision between privileged teachers and privileged students via advantage gap—offers a blueprint for selective, context-aware knowledge transfer beyond LLMs and VLMs. The same principles can be adapted to continuous-control RL: on-policy agent updates can be informed by the per-state advantage gap between a privileged demonstrator and the current policy, focusing strong supervision on states exhibiting true capability distinctions (Yu et al., 29 Jun 2026).

This reframing aligns with recent perspectives that position OPD as a modular interface for routed feedback—where choices of state source, credit assignment, privilege masking, and routing are dynamically determined per context (Zhang, 22 Jun 2026). A plausible implication is the emergence of a new design space for hybrid OPD methods, integrating regime-aware routing, counterfactual probability reassignment, and privilege compressibility diagnostics.

7. Practical Considerations and Limitations

DOPD's implementation presupposes batch-wise computation of the privilege advantage gap, as well as routine availability of both privileged teacher and student policies during training.
The weights governing regime strength ( $\mathcal{L}_{\rm OPD}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot|x)} \left[\sum_{t=1}^T D_{\rm KL}\big(\pi_\theta(\cdot|x,y_{<t}) \, \| \, \pi^*(\cdot|x,y_{<t})\big)\right].$ 5) can be tuned, with empirical optima reported at $\mathcal{L}_{\rm OPD}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot|x)} \left[\sum_{t=1}^T D_{\rm KL}\big(\pi_\theta(\cdot|x,y_{<t}) \, \| \, \pi^*(\cdot|x,y_{<t})\big)\right].$ 6 (Yu et al., 29 Jun 2026).
Forms of privileged information most beneficial for distillation are task-dependent: step-wise hints (sans execution traces) in LLMs and bounding boxes with labels in VLMs yield largest gains; over-privileging (e.g., final answer access) degrades performance.
DOPD generalizes across model scales, domains, continual learning scenarios, and demonstrates strong OOD resilience, but requires further validation at very large scales and in multi-agent or partially observed domains. Residual privilege effects at extreme model gaps and stability in highly dynamic environments warrant further study.

References:

"DOPD: Dual On-Policy Distillation" (Yu et al., 29 Jun 2026)
"A Formula-Driven Survey and Research Agenda for On-Policy Distillation" (Zhang, 22 Jun 2026)
"A Survey of On-Policy Distillation for LLMs" (Song et al., 1 Apr 2026)
"Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization" (Yu et al., 6 May 2026)

Markdown Report Issue Upgrade to Chat

References (4)

A Survey of On-Policy Distillation for Large Language Models (2026)

DOPD: Dual On-policy Distillation (2026)

A Formula-Driven Survey and Research Agenda for On-Policy Distillation (2026)

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to On-Policy Distillation.