Generalized On-Policy Distillation (G-OPD)
- G-OPD is a unified framework combining knowledge distillation with on-policy reinforcement learning via KL-constrained optimization.
- It introduces a flexible reference model and a reward scaling factor to interpolate between imitation learning and reward maximization.
- Empirical results show that, with proper tuning, reward extrapolation enables student models to outperform their teachers across diverse tasks.
Generalized On-Policy Distillation (G-OPD) encompasses a theoretical and algorithmic framework that unifies knowledge distillation and reinforcement learning (RL) under KL-constrained optimization. G-OPD extends standard on-policy distillation (OPD) by introducing a flexible reference model and a reward scaling parameter, thus allowing distillation processes to interpolate between imitation and reward maximization, and in certain regimes, to enable student models to surpass their teachers' performance. It provides a mathematical and empirical foundation for advanced distillation procedures, including black-box adversarial variants and reward extrapolation, in both white-box and black-box settings (Yang et al., 12 Feb 2026, Ye et al., 13 Nov 2025).
1. Theoretical Foundations and Objective
OPD is formalized as a dense, token-level KL-constrained RL problem. For an autoregressive student policy parameterized by , a fixed teacher policy , and an arbitrary reference model , the canonical KL-regularized RL objective is
where is the prompt distribution, is scalar trajectory reward, and controls regularization.
Standard OPD corresponds to reward with and a fixed reference :
where provides dense token-level reward signals:
The generalized G-OPD objective introduces a reward scaling factor :
Varying allows interpolation () between reference and teacher or extrapolation () beyond the teacher (Yang et al., 12 Feb 2026).
2. Instantiations: White-Box, Black-Box, and Adversarial G-OPD
The white-box regime requires access to teacher model logits to compute . In contrast, black-box G-OPD, operationalized as Generative Adversarial Distillation (GAD), dispenses with teacher logits and instead employs a discriminator trained to distinguish pairs from teacher and student. The student receives the discriminator's output as on-policy reward, and is trained via RL algorithms (REINFORCE, GRPO) to maximize expected log-reward:
The discriminator employs a pairwise Bradley–Terry loss and co-evolves online with the student (Ye et al., 13 Nov 2025).
A summary comparison:
| Setting | Teacher Access | Reward Signal |
|---|---|---|
| White-box OPD | Teacher logits | |
| Black-box G-OPD/GAD | Text samples only |
3. Reward Extrapolation (ExOPD) and Surpassing the Teacher
Reward extrapolation sets in the G-OPD objective, overweighting the teacher's implicit reward. Empirical results in both single-teacher and multi-teacher (domain-expert merging) configurations demonstrate that ExOPD not only achieves parity with the teacher but can exceed the teacher's performance boundary, particularly where the teacher itself was optimized from a base model via RL. For example, with , ExOPD produced +2.0 points (math) and +0.9 points (code) over OPD, surpassing the aggregation of RL-based domain teachers (Yang et al., 12 Feb 2026):
$\text{Remark: } \text{When $\pi^*\alpha>1\pi^*$}.$
However, excessively large can cause reward hacking and instability.
4. Reference Model Choices and Reward Correction
G-OPD is parameterized by a flexible reference model. In the multi-teacher/same-size setting, the reference is the common base . In strong-to-weak distillation (distilling a large teacher to a smaller student), possible choices are the student's base () or, where available, the teacher's pre-RL base (). The latter enables "reward correction":
Reward correction more faithfully reflects the RL training signal but incurs increased computational cost since access to the pre-trained teacher base and extra forward passes are required. Empirically, this choice provides an additional +1–2 point performance improvement (Yang et al., 12 Feb 2026).
5. Algorithmic Implementation and Stabilization
G-OPD admits a token-wise policy gradient estimator with advantage:
where denotes the per-token reward term. In black-box G-OPD (GAD), alternating update pseudocode comprises:
- Warmup phase: Supervised cross-entropy (MLE) on teacher outputs and initial discriminator updates for stabilization.
- Alternating co-training: Mini-batch sampling, student rollout, discriminator scoring and update, student policy update via policy gradient.
Robustness techniques include PPO/GRPO-based policy update clipping, multiple discriminator steps per generator update, and monitoring for reward gaming (e.g., excessive response length) (Ye et al., 13 Nov 2025).
6. Empirical Results and Analysis
G-OPD and its variants—including OPD, ExOPD, and GAD—were evaluated on datasets including LMSYS-Chat-1M-Clean, AIME (math), HMMT, and HumanEval+ (code). Empirical findings include:
- On LMSYS and out-of-distribution datasets, GAD outperforms supervised sequence-level knowledge distillation (SeqKD), with +1.7 points on OOD splits (Ye et al., 13 Nov 2025).
- ExOPD () yields consistent improvement over OPD across math and code, and achieves super-teacher performance in multi-domain and strong-to-weak settings +2.7 points over OPD in 30B4B/1.7B distillation.
- Ablations reveal that removing warmup or switching to a fixed, off-policy discriminator causes performance collapse or reward hacking.
7. Limitations, Applications, and Extensions
The principal limitations of G-OPD are computational in nature. Reward correction with large reference models induces double-forward cost per token, and accessing a teacher's pre-RL checkpoint may be operationally prohibitive. The selection of is critical: excessive values can destabilize training. Dense implicit rewards, central to OPD/G-OPD, may not generalize to highly stochastic or open-ended tasks.
Applications include domain expert merging, strong-to-weak and multi-teacher distillation, and black-box API distillation. G-OPD readily accommodates multimodal extensions (image-text, etc.), per-token reward shaping, and hierarchical abstraction stacking. In black-box regimes, the framework subsumes adversarial methods (e.g., GAD), further broadening its practical impact (Ye et al., 13 Nov 2025).
G-OPD provides a unified, theoretically-grounded, and practically validated approach to on-policy distillation, encapsulating both RL and distillation settings, while introducing flexibility in reward weighting and reference allocation that can drive student models to or beyond teacher-level performance (Yang et al., 12 Feb 2026, Ye et al., 13 Nov 2025).