On-Policy Distillation (OPD)

Updated 4 July 2026

OPD is a post-training paradigm where a student model samples its own trajectories and receives token-level teacher feedback to mitigate exposure bias.
It aligns the training state distribution with real inference conditions, addressing train-test mismatch and cumulative autoregressive errors.
OPD encompasses diverse methodologies including imitation learning and dense token-level reinforcement learning, with applications in reasoning, code generation, and agentic tasks.

On-policy distillation (OPD) is a post-training paradigm for LLMs in which a student policy is optimized on trajectories sampled from the student itself, while a stronger teacher supplies feedback on those student-generated states. In contrast to standard supervised fine-tuning and classical off-policy distillation on fixed teacher-forced data, OPD aligns the training distribution with inference-time behavior, directly addressing exposure bias and the train–test mismatch induced by autoregressive error compounding. Across the recent literature, OPD is treated both as interactive imitation learning and as a dense token-level reinforcement-learning surrogate, with formulations spanning divergence minimization, reward-guided learning, and self-distillation (Song et al., 1 Apr 2026, Fang et al., 8 May 2026).

1. Definition and formal structure

In the standard formulation, a prompt $x$ is drawn from a data distribution, the student samples a completion $y \sim \pi_\theta(\cdot \mid x)$ , and the teacher evaluates the prefixes the student actually visits. A representative objective used across papers is

$\mathcal{L}_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D}\; \mathbb{E}_{y \sim \pi_S(\cdot\mid x)} \frac{1}{|y|} \sum_{t=1}^{|y|} \mathcal{D}\!\left( \pi_T(\cdot \mid x, y_{<t}) \,\Big\|\, \pi_S(\cdot \mid x, y_{<t}) \right),$

where $\mathcal{D}$ may be a KL divergence or another $f$ -divergence (Lazaridis et al., 22 May 2026). In logit-based OPD, the teacher provides full next-token distributions on the student’s own prefixes, and the student is updated to match them token by token (Fang et al., 8 May 2026).

The core distinction from off-policy distillation is distributional. Off-policy methods train on a static distribution induced by a dataset or teacher rollouts, whereas OPD trains under the student’s own state-visitation distribution. The survey literature places this shift in the framework of interactive imitation learning: off-policy supervision measures per-step error under the expert’s state distribution, while querying the teacher on learner states reduces the mismatch that otherwise causes autoregressive errors to compound (Song et al., 1 Apr 2026). This framing also explains why OPD has become a central post-training mechanism for reasoning models: it preserves dense teacher feedback while training on the states that matter at deployment.

A second, equally important formal perspective treats OPD as dense token-level RL. If the state is the current prefix and the action is the next token, then choosing a divergence is equivalent to choosing a per-token reward. Several recent works exploit this equivalence explicitly, either to analyze divergence choice or to redesign the reward used by sampled-token OPD estimators (Wang et al., 2 May 2026, Zhao et al., 15 Jun 2026).

2. Objectives, divergences, and estimators

The modern OPD literature is not organized around a single loss. Forward KL, reverse KL, Jensen–Shannon divergence, skewed KL variants, sequence-level reverse KL, and PPO-style surrogates all appear as legitimate OPD instantiations (Song et al., 1 Apr 2026). Their differences are not cosmetic. In the analysis of agentic and code settings, forward KL has bounded logit gradients but is mode-covering; reverse KL is mode-seeking but can exhibit unbounded gradients when the teacher assigns near-zero mass to student-sampled tokens; and $JSD_{0.5}$ provides bounded loss and bounded logit gradients, which makes it attractive for multi-step agentic stability (Wang et al., 2 May 2026).

This divergence sensitivity has practical consequences. MAD-OPD derives a task-adaptive principle: use $JSD$ for multi-step agentic tasks, where stability under privileged teacher–student gaps is decisive, and use reverse KL for code generation, where mode concentration favors coherent single-path solutions (Wang et al., 2 May 2026). By contrast, several reasoning-oriented OPD pipelines retain reverse KL because of its strong local corrective behavior on student-generated prefixes, even when that choice creates variance problems in sampled-token training (Zhang et al., 16 Feb 2026, Oh et al., 8 May 2026).

A major line of work addresses exactly those variance problems. Standard sampled-token reverse-KL OPD uses the log-ratio reward

$r_t^{\mathrm{OPD}}(c_t,o_t)=\log \frac{\pi_T(o_t \mid c_t)}{\pi_\theta(o_t \mid c_t)},$

which is unbiased as a single-sample Monte Carlo estimator but unbounded by construction. PowerOPD diagnoses the resulting pathologies—sample inefficiency, unstable generation dynamics, and a gap to exact full-vocabulary OPD—and replaces the log-ratio with a bounded, sign-consistent Box–Cox power transformation,

$r_t^\alpha = \pi_T(o_t\mid c_t)^\alpha - \pi_\theta(o_t\mid c_t)^\alpha, \qquad \alpha>0,$

with the log-ratio recovered in the degenerate $\alpha \to 0$ limit (Zhao et al., 15 Jun 2026). The same paper reports benchmark-averaged Avg@8/Pass@8 gains of up to $y \sim \pi_\theta(\cdot \mid x)$ 0 over vanilla OPD, gains of $y \sim \pi_\theta(\cdot \mid x)$ 1 over post-hoc stabilization, gains of $y \sim \pi_\theta(\cdot \mid x)$ 2 over full-vocabulary OPD, a 59.2% wall-clock reduction, and gradient norms more than 3,000x smaller than vanilla OPD (Zhao et al., 15 Jun 2026).

A second stabilization route keeps the original objective but reduces estimator variance through a control-variate baseline. vOPD casts OPD as policy-gradient RL and shows that the value function has a closed form: it is the per-token negative reverse KL between student and teacher. Subtracting this detached baseline preserves the lightweight single-sample estimator and keeps the gradient unbiased, while reducing variance without a separate critic network (Oh et al., 8 May 2026). Empirically, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline on mathematical and scientific reasoning benchmarks (Oh et al., 8 May 2026).

3. Teacher access and supervision regimes

A central fault line in OPD research is teacher access. The survey literature organizes the field along three orthogonal dimensions—feedback signal, teacher access, and loss granularity—with teacher access split into white-box, black-box, and teacher-free regimes (Song et al., 1 Apr 2026). Classical logit-based OPD is a white-box method: it requires teacher logits at every generation step and often benefits from aligned tokenizers or vocabularies (Fang et al., 8 May 2026). This requirement has become increasingly restrictive as the strongest teachers are frequently proprietary APIs that expose only text.

Recent work has therefore expanded OPD beyond white-box logit matching. Rubric-based On-policy Distillation replaces token-level teacher logits with prompt-specific semantic rubrics induced from teacher–student contrasts. In ROPD, each rubric item $y \sim \pi_\theta(\cdot \mid x)$ 3 contains a binary-evaluable natural-language criterion and an integer importance weight; a Verifier scores student rollouts against the rubric, and the resulting rubric scores drive GRPO updates (Fang et al., 8 May 2026). In the white-box comparison reported in that work, ROPD—despite using teacher text only—achieves 45.87 average over AIME24/25 and HMMT Feb/Nov versus 35.25 for ExOPD and 32.82 for LOPD, closing 74.1% of the student–teacher gap versus 42.1% for LOPD, or 1.8× better gap closing (Fang et al., 8 May 2026).

OmniOPD pursues a different black-box route. It groups tokens into chunks, queries the teacher with Monte Carlo rollouts on selected high-entropy student prefixes, scores semantic agreement with a chunk-level similarity metric, smooths the estimate with a Dirichlet-Multinomial prior, and constrains un-audited tokens with a base-model KL anchor (Zhou et al., 31 May 2026). This design is explicitly tokenizer-invariant and text-only. On math, OmniOPD surpasses the standard OPD approach by up to +28.64%, and with black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash it achieves an additional +9.54% relative improvement over its open-weight teacher counterpart (Zhou et al., 31 May 2026).

Other black-box-compatible variants modify the teacher signal more locally. BRTS samples a small pool of teacher trajectories and selects a teacher-context supervision branch by the rule “correctness first, student alignment second,” with a ground-truth-conditioned recovery step when unconditioned teacher samples fail (Zhang et al., 10 May 2026). On AIME24, AIME25, and AMC23, this improves over standard OPD, with the largest gains on harder datasets (Zhang et al., 10 May 2026). The broader implication is that OPD is no longer equivalent to white-box next-token imitation; it now includes rubric induction, semantic verification, rollout selection, discriminator-based rewards, verbal scores, and self-distillation with privileged context (Song et al., 1 Apr 2026, Lazaridis et al., 22 May 2026).

A compact way to view these regimes is as follows.

Regime	Representative papers	Characteristic signal
White-box logit OPD	(Wang et al., 2 May 2026, Oh et al., 8 May 2026, Zhao et al., 15 Jun 2026)	Token-level teacher distributions or log-ratio rewards
Black-box text-only OPD	(Fang et al., 8 May 2026, Zhou et al., 31 May 2026, Zhang et al., 10 May 2026)	Rubrics, semantic chunk verification, selected teacher trajectories
Self-distillation with privileged context	(Lazaridis et al., 22 May 2026)	Same model as teacher with added privileged context

This taxonomy also corrects a common misconception. OPD is often identified with logit access, but current work shows that the on-policy aspect is logically separate from the supervision modality. A plausible implication is that the defining feature of OPD is not teacher logits per se, but student-state-conditioned feedback.

4. Long-horizon reasoning, position bias, and efficient supervision

Long-horizon reasoning exposes a persistent weakness in OPD: teacher supervision quality degrades with position as student trajectories drift from the teacher’s preferred continuation manifold. Several papers isolate this effect from different angles. “On the Position Bias of On-Policy Distillation” reports that OPD using only the first 30% of tokens can perform comparably to using all tokens, whereas OPD using only the last 30% of tokens barely learns anything (Xie et al., 21 Jun 2026). The same work formalizes token importance through constrained optimization and derives Importance-Weighted OPD (IW-OPD), where the token weight depends on accumulated teacher–student discrepancy, naturally upweighting earlier tokens and downweighting later ones. In both same-size and cross-scale settings, IW-OPD converges faster and improves performance by up to 6.9 points on AIME-2025 (Xie et al., 21 Jun 2026).

A closely related efficiency argument appears in “Fast and Effective On-policy Distillation from Reasoning Prefixes.” That paper observes that during OPD training, signal is often concentrated in reasoning prefixes and that even a short teacher-generated prefix can significantly help the student produce the correct answer. Its prefix-only variant applies the distillation objective only to prefixes of student-generated outputs and terminates sampling early during distillation, matching the performance of full OPD while reducing training FLOP by 2x-47x (Zhang et al., 16 Feb 2026). This suggests that a large fraction of OPD’s cost is spent supervising low-value suffix tokens.

Prune-OPD makes that observation dynamic rather than static. It monitors local student–teacher compatibility, for example through top- $y \sim \pi_\theta(\cdot \mid x)$ 4 overlap, detects prefix-drift events online, monotonically down-weights unreliable rewards, and triggers dynamic rollout truncation. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6%–68.0% while preserving, and often improving, performance on AMC, AIME, and HMMT; when compatibility remains high, it expands the training window to preserve long-context supervision (Yang et al., 8 May 2026).

These works converge on a shared diagnosis: not all positions in an on-policy rollout are equally informative. This diagnosis also helps explain why OPD stabilization has increasingly focused on reward design, variance control, and adaptive truncation rather than only on stronger teachers. A plausible implication is that long-horizon OPD is less a problem of obtaining denser feedback than of allocating feedback to positions where teacher information remains locally exploitable.

5. Multi-turn agents, privileged context, and asynchronous systems

The original OPD literature concentrated on single-turn reasoning, but current work extends the paradigm to agentic environments, privileged-context self-distillation, and asynchronous pipelines. In multi-turn agents, early mistakes change future observations, so dense token-level imitation can propagate unreliable teacher supervision on corrupted histories. SAGE-OPD addresses this with a verifier-free selective intervention framework: the teacher decides whether each student turn should be skipped, weakly intervened on, or strongly intervened on; token-level distillation is then weighted by teacher confidence and normalized to preserve the overall loss scale. On agent tasks, SAGE-OPD achieves up to a 13.3% relative improvement in ALFWorld unseen success rate over standard OPD (Zhou et al., 17 Jun 2026).

Guided-OPD takes a complementary approach. Instead of selectively weighting losses after the fact, it mixes teacher- and student-generated turns inside each rollout and decays the teacher intervention probability to zero over training. Early teacher turns keep trajectories near the teacher distribution; late training returns to the purely on-policy regime used at inference (Li et al., 14 Jun 2026). On ALFWorld, ScienceWorld, and WebShop, distilling Qwen3 students from a Qwen3-30B-A3B teacher, Guided-OPD improves Score by 21.1% and Success Rate by 25.5% over vanilla OPD on average, with larger gains on smaller students (Li et al., 14 Jun 2026).

MAD-OPD addresses a different bottleneck: the single-teacher capability ceiling. It replaces the teacher with a debating collective of teachers, weights their token-level supervision by post-debate confidence, and extends OPD to agentic settings through On-Policy Agentic Distillation (OPAD), which performs step-level sampling to stabilize long trajectories (Wang et al., 2 May 2026). Across six teacher–student configurations and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; in the 14B+8B $y \sim \pi_\theta(\cdot \mid x)$ 54B setting it lifts the agentic average by $y \sim \pi_\theta(\cdot \mid x)$ 6 and the code average by $y \sim \pi_\theta(\cdot \mid x)$ 7 over the stronger single-teacher OPD (Wang et al., 2 May 2026).

Privileged-context self-distillation introduces yet another variant. EDGE-OPD studies On-Policy Self-Distillation (OPSD), where the same base model acts as teacher and student but the teacher receives extra training-time context such as a persona, a private fact, or a worked solution (Lazaridis et al., 22 May 2026). The paper shows that unguided OPSD and RLSD-no-verifier completely fail to learn a rare identity, whereas guided rollouts make identity transfer possible; it then adds an evidence mask so that updates are applied only where the privileged context supports the sampled token (Lazaridis et al., 22 May 2026). The results are mixed across task types: positive-evidence masking localizes persona signal effectively in the identity setting, but positive-evidence masking is harmful on the math axis, where near-zero evidence or verifier-based methods are safer (Lazaridis et al., 22 May 2026).

Asynchronous execution adds a systems-level complication. AsyncOPD studies stale-policy data in asynchronous OPD and shows that KL direction changes the stale-data problem: teacher-weighted forward KL is more robust to stale rollouts, whereas student-weighted reverse KL is vulnerable (Kang et al., 23 Jun 2026). For the reverse-KL case, methods borrowed from asynchronous RL do not improve over a simpler OPD-specific surrogate—recomputing the reverse-KL signal under the current student at learner time—and finite teacher-score caches induce a bias–variance tradeoff that motivates multi-sample Monte Carlo estimators (Kang et al., 23 Jun 2026). The resulting asynchronous pipeline improves training throughput by $y \sim \pi_\theta(\cdot \mid x)$ 8 to $y \sim \pi_\theta(\cdot \mid x)$ 9 over strict synchronous training while reaching comparable accuracy (Kang et al., 23 Jun 2026). A related freshness-aware system, $\mathcal{L}_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D}\; \mathbb{E}_{y \sim \pi_S(\cdot\mid x)} \frac{1}{|y|} \sum_{t=1}^{|y|} \mathcal{D}\!\left( \pi_T(\cdot \mid x, y_{<t}) \,\Big\|\, \pi_S(\cdot \mid x, y_{<t}) \right),$ 0-OPD, decomposes asynchronous objective discrepancy into rollout drift and supervision drift, then adaptively weights stale samples and refreshes the buffer; it achieves task performance comparable to synchronous optimization while largely retaining asynchronous throughput advantages (Chen et al., 18 May 2026).

6. Empirical landscape, limitations, and open directions

The contemporary OPD landscape is broad enough that a single canonical recipe no longer exists. The survey literature organizes the space by feedback signal—logit-based, outcome-based, or self-play—by teacher access—white-box, black-box, or teacher-free—and by loss granularity—token-level, sequence-level, or hybrid (Song et al., 1 Apr 2026). Within that space, OPD is already central to industrial and open-source post-training pipelines: recent open-source frontier models such as Qwen3 and DeepSeek-v4 use OPD as a central alignment technique, and industrial deployments discussed in the survey include Qwen3, Gemma 2, MiMo-V2, and Nemotron-Cascade 2 (Fang et al., 8 May 2026, Song et al., 1 Apr 2026).

The empirical record is correspondingly heterogeneous. ROPD reports up to a 10x gain in sample efficiency over advanced logit-based OPD methods (Fang et al., 8 May 2026). OmniOPD shows that chunk-level semantic verification can outperform white-box token-level OPD in math, while remaining compatible with proprietary teachers (Zhou et al., 31 May 2026). MAD-OPD demonstrates that multi-teacher deliberation can lift both agentic and code OPD (Wang et al., 2 May 2026). PowerOPD and vOPD show that much of vanilla OPD’s instability comes from the sampled-token estimator rather than from the on-policy principle itself (Zhao et al., 15 Jun 2026, Oh et al., 8 May 2026). Collectively, these results suggest that the performance ceiling of OPD is not fixed by a single divergence or teacher interface.

At the same time, several limitations recur. Experiments remain concentrated in formal reasoning domains—math, science, medicine, code, and text-based agents—while evidence on subjective or creative tasks is still limited (Fang et al., 8 May 2026, Lazaridis et al., 22 May 2026). Black-box methods depend on the quality of rubricators, verifiers, or semantic similarity metrics, and broader validation across judge models is still needed (Fang et al., 8 May 2026, Zhou et al., 31 May 2026). Self-distillation with privileged context raises explicit misuse concerns because the same mechanisms that internalize benign personas could inject covert identities or behaviors (Lazaridis et al., 22 May 2026). And asynchronous OPD, although increasingly necessary at scale, introduces nontrivial rollout and supervision drift that current theory characterizes more readily than it fully resolves (Chen et al., 18 May 2026, Kang et al., 23 Jun 2026).

The survey identifies several open problems that now define the research frontier: distillation scaling laws, uncertainty-aware feedback, dynamic curricula, latent-space and cross-architecture distillation, agent-level OPD, multimodal OPD, and principled combinations of OPD with RL (Song et al., 1 Apr 2026). A plausible synthesis of the recent literature is that OPD is moving away from a narrow interpretation—student rollouts plus white-box token-level KL—and toward a more general family of student-state-conditioned alignment methods. In that broader sense, OPD has become a unifying framework for distillation under deployment-time state distributions, whether the supervisory signal is logits, rubrics, semantic chunk agreement, debate-conditioned teacher ensembles, or privileged-context evidence.