Generalized On-Policy Distillation (G-OPD)

Updated 15 February 2026

G-OPD is a unified framework combining knowledge distillation with on-policy reinforcement learning via KL-constrained optimization.
It introduces a flexible reference model and a reward scaling factor to interpolate between imitation learning and reward maximization.
Empirical results show that, with proper tuning, reward extrapolation enables student models to outperform their teachers across diverse tasks.

Generalized On-Policy Distillation (G-OPD) encompasses a theoretical and algorithmic framework that unifies knowledge distillation and reinforcement learning (RL) under KL-constrained optimization. G-OPD extends standard on-policy distillation (OPD) by introducing a flexible reference model and a reward scaling parameter, thus allowing distillation processes to interpolate between imitation and reward maximization, and in certain regimes, to enable student models to surpass their teachers' performance. It provides a mathematical and empirical foundation for advanced distillation procedures, including black-box adversarial variants and reward extrapolation, in both white-box and black-box settings (Yang et al., 12 Feb 2026, Ye et al., 13 Nov 2025).

1. Theoretical Foundations and Objective

OPD is formalized as a dense, token-level KL-constrained RL problem. For an autoregressive student policy $\pi_n(\cdot|x; \theta)$ parameterized by $\theta$ , a fixed teacher policy $\pi^*(\cdot|x)$ , and an arbitrary reference model $\pi_{\rm ref}(\cdot|x)$ , the canonical KL-regularized RL objective is

$J_{\rm RL}(\theta) = \max_\theta \mathbb{E}_{x\sim D,\, y \sim \pi_n(\cdot|x)} \left[ r(x,y) - \beta D_{\rm KL}\big(\pi_n(\cdot|x) \| \pi_{\rm ref}(\cdot|x)\big)\right]$

where $D$ is the prompt distribution, $r(x,y)$ is scalar trajectory reward, and $\beta$ controls regularization.

Standard OPD corresponds to reward $R(y|x) = \log\frac{\pi^*(y|x)}{\pi_{\rm ref}(y|x)}$ with $\beta=1$ and a fixed reference $\theta$ 0:

$\theta$ 1

where $\theta$ 2 provides dense token-level reward signals:

$\theta$ 3

The generalized G-OPD objective introduces a reward scaling factor $\theta$ 4:

$\theta$ 5

Varying $\theta$ 6 allows interpolation ( $\theta$ 7) between reference and teacher or extrapolation ( $\theta$ 8) beyond the teacher (Yang et al., 12 Feb 2026).

2. Instantiations: White-Box, Black-Box, and Adversarial G-OPD

The white-box regime requires access to teacher model logits to compute $\theta$ 9. In contrast, black-box G-OPD, operationalized as Generative Adversarial Distillation (GAD), dispenses with teacher logits and instead employs a discriminator $\pi^*(\cdot|x)$ 0 trained to distinguish $\pi^*(\cdot|x)$ 1 pairs from teacher and student. The student receives the discriminator's output $\pi^*(\cdot|x)$ 2 as on-policy reward, and is trained via RL algorithms (REINFORCE, GRPO) to maximize expected log-reward:

$\pi^*(\cdot|x)$ 3

The discriminator employs a pairwise Bradley–Terry loss and co-evolves online with the student (Ye et al., 13 Nov 2025).

A summary comparison:

Setting	Teacher Access	Reward Signal
White-box OPD	Teacher logits	$\pi^*(\cdot\|x)$ 4
Black-box G-OPD/GAD	Text samples only	$\pi^*(\cdot\|x)$ 5

3. Reward Extrapolation (ExOPD) and Surpassing the Teacher

Reward extrapolation sets $\pi^*(\cdot|x)$ 6 in the G-OPD objective, overweighting the teacher's implicit reward. Empirical results in both single-teacher and multi-teacher (domain-expert merging) configurations demonstrate that ExOPD not only achieves parity with the teacher but can exceed the teacher's performance boundary, particularly where the teacher itself was optimized from a base model via RL. For example, with $\pi^*(\cdot|x)$ 7, ExOPD produced +2.0 points (math) and +0.9 points (code) over OPD, surpassing the aggregation of RL-based domain teachers (Yang et al., 12 Feb 2026):

$\pi^*(\cdot|x)$ 8

However, excessively large $\pi^*(\cdot|x)$ 9 can cause reward hacking and instability.

4. Reference Model Choices and Reward Correction

G-OPD is parameterized by a flexible reference model. In the multi-teacher/same-size setting, the reference is the common base $\pi_{\rm ref}(\cdot|x)$ 0. In strong-to-weak distillation (distilling a large teacher $\pi_{\rm ref}(\cdot|x)$ 1 to a smaller student), possible choices are the student's base ( $\pi_{\rm ref}(\cdot|x)$ 2) or, where available, the teacher's pre-RL base ( $\pi_{\rm ref}(\cdot|x)$ 3). The latter enables "reward correction":

$\pi_{\rm ref}(\cdot|x)$ 4

Reward correction more faithfully reflects the RL training signal but incurs increased computational cost since access to the pre-trained teacher base and extra forward passes are required. Empirically, this choice provides an additional +1–2 point performance improvement (Yang et al., 12 Feb 2026).

5. Algorithmic Implementation and Stabilization

G-OPD admits a token-wise policy gradient estimator with advantage:

$\pi_{\rm ref}(\cdot|x)$ 5

where $\pi_{\rm ref}(\cdot|x)$ 6 denotes the per-token reward term. In black-box G-OPD (GAD), alternating update pseudocode comprises:

Warmup phase: Supervised cross-entropy (MLE) on teacher outputs and initial discriminator updates for stabilization.
Alternating co-training: Mini-batch sampling, student rollout, discriminator scoring and update, student policy update via policy gradient.

Robustness techniques include PPO/GRPO-based policy update clipping, multiple discriminator steps per generator update, and monitoring for reward gaming (e.g., excessive response length) (Ye et al., 13 Nov 2025).

6. Empirical Results and Analysis

G-OPD and its variants—including OPD, ExOPD, and GAD—were evaluated on datasets including LMSYS-Chat-1M-Clean, AIME (math), HMMT, and HumanEval+ (code). Empirical findings include:

On LMSYS and out-of-distribution datasets, GAD outperforms supervised sequence-level knowledge distillation (SeqKD), with +1.7 points on OOD splits (Ye et al., 13 Nov 2025).
ExOPD ( $\pi_{\rm ref}(\cdot|x)$ 7) yields consistent improvement over OPD across math and code, and achieves super-teacher performance in multi-domain and strong-to-weak settings +2.7 points over OPD in 30B $\pi_{\rm ref}(\cdot|x)$ 84B/1.7B distillation.
Ablations reveal that removing warmup or switching to a fixed, off-policy discriminator causes performance collapse or reward hacking.

7. Limitations, Applications, and Extensions

The principal limitations of G-OPD are computational in nature. Reward correction with large reference models induces double-forward cost per token, and accessing a teacher's pre-RL checkpoint may be operationally prohibitive. The selection of $\pi_{\rm ref}(\cdot|x)$ 9 is critical: excessive values can destabilize training. Dense implicit rewards, central to OPD/G-OPD, may not generalize to highly stochastic or open-ended tasks.

Applications include domain expert merging, strong-to-weak and multi-teacher distillation, and black-box API distillation. G-OPD readily accommodates multimodal extensions (image-text, etc.), per-token reward shaping, and hierarchical abstraction stacking. In black-box regimes, the framework subsumes adversarial methods (e.g., GAD), further broadening its practical impact (Ye et al., 13 Nov 2025).

G-OPD provides a unified, theoretically-grounded, and practically validated approach to on-policy distillation, encapsulating both RL and distillation settings, while introducing flexibility in reward weighting and reference allocation that can drive student models to or beyond teacher-level performance (Yang et al., 12 Feb 2026, Ye et al., 13 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation (2026)

Black-Box On-Policy Distillation of Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized On-Policy Distillation (G-OPD).