On-Policy Knowledge Distillation

Updated 9 April 2026

On-policy knowledge distillation is a learning paradigm where a student model generates its own trajectories and receives direct corrective feedback from a teacher model.
It effectively reduces exposure bias by aligning the student’s predictions with teacher signals during self-generated rollouts, improving multi-step reasoning and domain adaptation.
Methodological variants, such as reverse-KL, forward-KL, and hybrid strategies, offer scalable solutions balancing computational efficiency with robust performance.

On-policy knowledge distillation is a class of learning algorithms wherein a student model is trained by generating its own rollouts (trajectories or sequences) and receiving direct, corrective feedback from a teacher model or group of policies specifically on those self-generated outputs. This paradigm is central for reducing exposure bias, enhancing distributional robustness, and efficiently transferring complex behaviors—such as reasoning, domain adaptation, and long-horizon planning—from large or high-performing teachers to more compact student architectures. The theory and practice of on-policy distillation have advanced rapidly, producing a diverse taxonomy of approaches that differ in signal granularity, teacher accessibility, optimization protocol, and theoretical underpinnings (Song et al., 1 Apr 2026).

1. Foundational Concepts and Motivation

Off-policy distillation constrains a student to minimize a divergence (typically forward KL or cross-entropy) between its outputs and precomputed teacher trajectories under the data distribution $d_\mathcal{D}(s)$ determined by teacher-forced or supervised data. However, this induces a train–test mismatch: at inference the student generates tokens autoregressively, traversing states ( $d_{\pi_\theta}(s)$ ) unvisited during training. Error accumulation, termed exposure bias, degrades performance for multi-step reasoning, code, and interactive tasks.

On-policy knowledge distillation (OPD) addresses this by sampling states ( $s$ ) from the student’s own policy, $d_{\pi_\theta}(s)$ , and aligning its predictions to teacher feedback directly on these self-generated prefixes (Song et al., 1 Apr 2026, Ye et al., 12 Feb 2026). This connection to interactive imitation learning (notably DAgger) ensures that distillation loss is minimized where the student actually operates at test time, reducing compounding errors from $O(\epsilon T^2)$ to $O(\epsilon T)$ for per-step error $\epsilon$ and output horizon $T$ . This foundational principle underlies the design of most modern OPD algorithms.

2. Unified Mathematical Frameworks

At its most general, on-policy distillation can be expressed as minimizing an $f$ -divergence between teacher and student at each state, sampled from student-driven or mixed policies:

$\mathcal{L}_{\mathrm{OPD}}(\theta) = \mathbb{E}_{y\sim\pi_{\mathrm{mix}}} \left[ \sum_{t=1}^{|y|} D_f(p_T(\cdot|s_t), p_\theta(\cdot|s_t)) \right]$

where $d_{\pi_\theta}(s)$ 0 is an $d_{\pi_\theta}(s)$ 1-divergence (e.g., KL, reverse KL, JSD), $d_{\pi_\theta}(s)$ 2 and $d_{\pi_\theta}(s)$ 3 are teacher and student distributions, and $d_{\pi_\theta}(s)$ 4 interpolates between student and (optionally) teacher trajectories (Song et al., 1 Apr 2026).

Variants include:

Reverse-KL OPD (mode-seeking): Students are encouraged to concentrate probability on teacher-preferred tokens along their own rollouts (Ye et al., 12 Feb 2026).
Forward-KL OPD (mode-covering): Students are penalized for failing to cover the support of the teacher, but may suffer from zero-avoidance pathologies in sparse or OOD regions (Jin et al., 7 Mar 2026).
Hybrid or adaptive objectives: Methods such as Entropy-Aware OPD swap between reverse- and forward-KL depending on teacher output entropy, dynamically preserving diversity without sacrificing imitation precision (Jin et al., 7 Mar 2026).
Geometric bridges and adaptive targets: Techniques like Veto interpolate teacher and student distributions in logit space with a parameter $d_{\pi_\theta}(s)$ 5, stabilizing gradients and trading off entropy versus reward (Jang et al., 12 Jan 2026).
KL-constrained RL reformulations: On-policy distillation can be posed as a special case of dense KL-constrained RL with an implicit reward function determined by teacher log-likelihoods. This connection underpins Generalized OPD frameworks, allowing for reward weighting and reward extrapolation beyond the teacher's support (Yang et al., 12 Feb 2026).

Optimization strategies depend on the differentiability and directionality of $d_{\pi_\theta}(s)$ 6, and often combine policy-gradient or PPO-style surrogates for efficient credit assignment at the sequence and token levels.

3. Principal Methodological Variants

Recent literature classifies OPD methods along three axes (Song et al., 1 Apr 2026):

A. Feedback Signal

Logit-based OPD: Full teacher probability vectors at each token; dominant in LLM distillation and algorithmic policy transfer (Ye et al., 12 Feb 2026, Zhang et al., 16 Feb 2026).
Outcome- or preference-based OPD: Scalar rewards, ranking, or preference signals at the sequence level, leveraging human or automated feedback to guide the student via online or adversarial optimization (Jia, 25 May 2025, Ye et al., 13 Nov 2025).
Self-play and student-to-student: Multiple agents/peers distill knowledge from each other without an explicit teacher, e.g. dual policy distillation (Lai et al., 2020), group distillation with attention-based aggregation (Yu et al., 2024).

B. Teacher Access

White-box: Direct logit or model state access for exact per-token divergences.
Black-box: Only sampled teacher outputs; may require adversarial or preference modeling to synthesize feedback (e.g., GAD (Ye et al., 13 Nov 2025), SODA (Chen et al., 4 Apr 2026), OVD (Xiong et al., 29 Jan 2026)).
Teacher-free (self-distillation): The model adapts by distilling from its own checkpoints or from privileged "teacher" prompts.

C. Granularity

Token-level: Each token in the student trajectory receives dense supervision, maximizing local alignment (canonical in LLM compression) (Ye et al., 12 Feb 2026).
Sequence-level: Distillation loss is computed on entire output sequences, typically advantageous in reward-guided settings or when only global feedback is available (Jia, 25 May 2025).
Hybrid/adaptive: Mixed token- and sequence-objectives, prefix-truncated variants, and sliding-window or adaptive token selection to accelerate convergence and lower compute (Zhang et al., 16 Feb 2026, Peng et al., 9 Oct 2025).

Representative methods and their properties are summarized below:

Method	Feedback	Teacher Access	Granularity
OPCD (Ye et al., 12 Feb 2026)	Logit-based	White-box	Token-level
EOPD (Jin et al., 7 Mar 2026)	Logit, adaptive	White-box	Token-level switch
RLAD (Zhang et al., 26 Feb 2026)	Advantage-based	White-box	Token + sequence
GAD (Ye et al., 13 Nov 2025)	Preference	Black-box	Sequence-level
OVD (Xiong et al., 29 Jan 2026)	Verbal/score	Black-box	Trajectory-level
SODA (Chen et al., 4 Apr 2026)	Contrastive	Black-box	Sequence-level
Fast-OPD (Zhang et al., 16 Feb 2026)	Logit/prefix	White-box	Token-level prefix
Dual Policy (Lai et al., 2020)	Peer logit	Teacher-free	Token/actor-critic

4. Key Empirical Findings and Applications

On-policy knowledge distillation delivers consistent gains in multi-step reasoning, code synthesis, domain adaptation, and privacy-constrained settings:

Superior accuracy and retention: Across LLM benchmarks, OPD variants outperform off-policy and context-distillation baselines on in- and out-of-domain tasks (Ye et al., 12 Feb 2026). For example, OPCD achieves 79.7% math accuracy versus 78.5% for context distillation, demonstrating improved generalization (e.g., IF-Eval OOD 81.7% versus 81.2%) (Ye et al., 12 Feb 2026).
Computational efficiency via prefix truncation: Distilling only student-generated prefixes (e.g. first 2048 tokens) instead of full sequences matches full OPD accuracy while reducing training FLOPs by up to 47× (Zhang et al., 16 Feb 2026).
Cross-size and cross-domain transfer: OPD reliably transfers solution traces or system prompt knowledge from larger (or multiple) teachers to smaller students, often yielding performance improvements beyond both source models when reward extrapolation is enabled (Yang et al., 12 Feb 2026).
Adaptive and robust exploration: Methods such as Veto (Jang et al., 12 Jan 2026), EOPD (Jin et al., 7 Mar 2026), and AdaSwitch (Peng et al., 9 Oct 2025) stabilize learning by dynamically interpolating between mode-seeking and mode-covering objectives or adaptively switching between student-driven and teacher-driven tokens. These approaches are effective in mitigating diversity collapse and optimizing sample efficiency.
Black-box and privacy settings: Adversarial (GAD (Ye et al., 13 Nov 2025), SODA (Chen et al., 4 Apr 2026)) and trajectory-level verbal feedback (OVD (Xiong et al., 29 Jan 2026)) enable on-policy distillation when teacher logits are unavailable, maintaining effectiveness with much lower peak memory or under strict privacy constraints (DP-OPD (Khadem et al., 6 Apr 2026)).
Reinforcement-aware extensions: Selective imitation based on advantage or dynamic trust region mixtures (RLAD (Zhang et al., 26 Feb 2026)) enable OPD to coexist with reward maximization, resolving traditional KL–RL interference and boosting performance in chain-of-thought reasoning and long-horizon tasks.

5. Limitations, Practical Challenges, and Theoretical Insights

While OPD closes key performance gaps, it also introduces new complexities:

Compute and memory overheads: Full teacher forward passes at every student token are generally 3–8× more expensive than off-policy distillation; mitigations include prefix truncation, caching, and quantization (Zhang et al., 16 Feb 2026, Song et al., 1 Apr 2026).
Stability and curriculum: OPD can destabilize when student and teacher distributions diverge early or are misaligned in capacity. Curriculum schedules, dynamic divergence adaptation (as in Veto), and hybrid off-/on-policy mixing are common remedies (Jang et al., 12 Jan 2026, Peng et al., 9 Oct 2025).
Feedback informativeness: Mode-seeking reverse KL collapses diversity where teachers are uncertain; adaptive loss switching (EOPD) or analytic target reformulation (Veto) are needed for robust generalization (Jin et al., 7 Mar 2026).
Teacher quality and calibration: OOD or miscalibrated teachers can propagate error or induce hallucination. Uncertainty-aware and reward-extrapolating methods offer partial solutions but full theoretical characterization remains open (Yang et al., 12 Feb 2026, Song et al., 1 Apr 2026).
Hybrid and agent-centric extensions: Modern deployments demand OPD for tool-using or multi-agent systems, requiring new frameworks for counterfactual feedback and dynamic curriculum generation (Song et al., 1 Apr 2026, Zhang et al., 26 Feb 2026).

6. Industrial Deployments and Open Research Questions

On-policy distillation forms the backbone of robust, high-stakes LLM deployments:

Qwen3, Gemma 2, and Nemotron-Cascade 2 employ dynamic teacher-student ensembles with massive on-policy curricula for mathematical, reasoning, and instructional domains (Song et al., 1 Apr 2026).
Speculative and constrained KD protocols (e.g., DistillSpec, Path-Consistency Learning) enable policies to request teacher intervention adaptively at test time within budgeted constraints, dominating the latency–accuracy Pareto front in controlled experiments (Liu et al., 24 Feb 2025, Peng et al., 9 Oct 2025).

Despite progress, several open challenges remain (Song et al., 1 Apr 2026):

Absence of scaling laws relating teacher/student size and OPD data requirements.
Lack of principled methods for decomposing and leveraging teacher uncertainty.
Need for curriculum and latent-space distillation frameworks able to handle tokenizer mismatch and multi-modal settings.
Necessity for rigorous evaluation on distribution-shifted and adversarial benchmarks, beyond standard held-out datasets.
The theoretical landscape for alternating or hybrid RL–OPD optimization remains under-explored.

7. Conclusion

On-policy knowledge distillation is a theoretically founded and empirically validated approach for overcoming train–test mismatch, exposure bias, and inefficiencies of standard off-policy supervised distillation. It is instantiated across a broad methodological spectrum—from token-level reverse KL minimization to trajectory-level preference-based learning and peer-to-peer co-distillation—and realized in white-box, black-box, and privacy-limited environments. As industrial-scale LLMs and agentic systems demand more robust, generalizable, and efficient compression, OPD is emerging as a critical paradigm. Key research frontiers include curriculum and uncertainty modeling, effective scaling, and integrated RL–KD design (Song et al., 1 Apr 2026, Ye et al., 12 Feb 2026, Jin et al., 7 Mar 2026, Jang et al., 12 Jan 2026, Ye et al., 13 Nov 2025, Jia, 25 May 2025).