OPRD: On-Policy Representation Distillation

Updated 4 July 2026

The paper demonstrates that aligning teacher and student hidden representations on on-policy rollouts bypasses the LM head, reducing variance and memory usage.
It shifts supervision from token probabilities to internal latent spaces, addressing exposure bias and enhancing model performance.
Empirical results show that OPRD outperforms traditional logit-based distillation with faster training, improved efficiency, and better alignment on benchmarks.

On-Policy Representation Distillation (OPRD) is a form of on-policy distillation in which the student is trained on trajectories sampled from its own policy, but supervision is moved from next-token probabilities into hidden-state or other structured representation spaces. In the explicit formulation introduced in "OPRD: On-Policy Representation Distillation", the student aligns teacher and student representations across selected layers on the same rollouts, thereby bypassing the LM head entirely; broader surveys place OPRD within the general OPD family as a representation-space extension of interactive imitation learning over student-visited states (Yang et al., 4 Jun 2026, Song et al., 1 Apr 2026, Zhang, 22 Jun 2026).

1. Conceptual scope and relation to on-policy distillation

Within the OPD literature, the defining change from conventional distillation is not primarily the divergence itself but the sampling measure. Instead of training on teacher-generated or dataset-fixed trajectories, the student generates its own trajectory $y = (y_1,\dots,y_T)$ , teacher feedback is evaluated on the resulting prefixes $y_{<t}$ , and optimization therefore occurs under the student visitation distribution. The survey formulation writes this as

$L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$

with $\pi_{\mathrm{mix}}$ equal to or heavily weighted toward the student policy in the on-policy case (Song et al., 1 Apr 2026).

This shift is motivated by exposure bias. Off-policy distillation trains on the dataset-induced state distribution $d_{\mathcal D}$ , whereas inference unfolds under the student-induced distribution $d_{\pi_\theta}$ . The survey explicitly connects OPD to interactive imitation learning, formalizing autoregressive generation as an MDP with state $s_t=(x,y_{<t})$ and action $a_t=y_t$ , and noting the DAgger-style distinction between $O(\epsilon T^2)$ error growth for behavior cloning and $O(\epsilon T)$ for interactive expert queries (Song et al., 1 Apr 2026). OPRD preserves this on-policy logic while changing the object of transfer: rather than matching only $y_{<t}$ 0, it aligns teacher and student internal states or latent distributions on those same student-generated states.

In that sense, OPRD is not a departure from OPD but a strict generalization of it. The formula-driven survey makes this explicit by treating OPRD as a “representation-support” extension in which feedback moves from vocabulary space into hidden-state space while the state source remains on-policy student rollouts (Zhang, 22 Jun 2026).

2. Formal objectives and theoretical rationale

A general latent-space view of OPRD introduces teacher and student representations $y_{<t}$ 1 and $y_{<t}$ 2, optionally projected into a common space $y_{<t}$ 3, $y_{<t}$ 4. The survey then writes a natural OPRD objective as

$y_{<t}$ 5

where $y_{<t}$ 6 are latent distributions induced by those representations (Song et al., 1 Apr 2026). In practice, this may reduce to MSE, cosine, or other direct representation losses.

The dedicated OPRD formulation instantiates this idea as layer- and position-selected hidden-state matching on student rollouts. If $y_{<t}$ 7, selected layers are $y_{<t}$ 8, supervised positions are marked by $y_{<t}$ 9, and hidden size is $L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$ 0, the loss is

$L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$ 1

The paper’s central claim is that this lifts distillation into hidden-state space on the same rollouts, bypasses the LM head entirely, eliminates sampling variance, and provides richer per-layer structural information (Yang et al., 4 Jun 2026).

The theoretical contrast with sampled-token OPD is sharp. The OPRD paper shows that sampled-token OPD is a REINFORCE-style estimator whose signal-to-noise ratio collapses as the student approaches the teacher, whereas the OPRD gradient is deterministic given the rollout and therefore has zero conditional variance. It also identifies an LM-head bottleneck: output-space distillation is blind to hidden-state perturbations in the effective null space of the LM head, and low-singular-value directions in an ill-conditioned head can hide large representation differences behind small output losses (Yang et al., 4 Jun 2026). This is the strongest formal argument in the literature for treating internal representations, rather than only logits, as first-class distillation targets.

3. Output-space OPD pathologies that motivate representation-aware methods

Several contemporaneous OPD papers diagnose structural problems that do not by themselves define OPRD but strongly motivate it. "Less is More: Early Stopping Rollout for On-Policy Distillation" identifies Off-Policy Teacher Decay: as student-generated prefixes become increasingly off-policy for the teacher, the teacher’s corrective ability decays and later-token supervision reverts toward token-completion behavior. ESR addresses this by truncating rollouts to the first response tokens and reports that it surpasses full-rollout OPD across model size, family, tasks, and training regime, while delivering much higher GPU efficiency and training stability (Ziheng et al., 26 May 2026).

"PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation" isolates a different failure mode. In vanilla reverse-KL OPD, the sampled-token reward

$L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$ 2

is unbounded, producing high-variance gradients concentrated at early positions. PowerOPD replaces this with a bounded, sign-consistent Box–Cox power reward

$L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$ 3

and reports gradient norms more than $L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$ 4 smaller than vanilla OPD, together with benchmark-averaged Avg@8/Pass@8 gains over both vanilla and post-hoc stabilized OPD (Zhao et al., 15 Jun 2026).

"Trajectory-Refined Distillation" identifies prefix failure: once a student prefix contains reasoning errors that cannot be extended into a correct solution without retraction, dense per-token supervision produces a bimodal teacher mixture and fragmented gradients. TRD therefore shifts the intervention from token weighting to trajectory-level correction, first revising the student rollout under teacher guidance and then distilling on the refined trajectory (Jiang et al., 7 Jun 2026).

Taken together, these results suggest that OPRD emerged in a broader environment where output-only OPD was already being pressed by variance, teacher-state incompatibility, and trajectory pathology. A plausible implication is that representation-space distillation, trajectory refinement, and competence-aware token selection are complementary responses to a common structural problem rather than competing explanations.

4. Methodological families and representative instantiations

The most direct family is explicit hidden-state matching. The OPRD paper aligns teacher and student representations across selected layers and positions on student rollouts, supervises all transformer layers by default, and, in its main configuration, focuses on the last $L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$ 5 response tokens because cosine similarity analysis shows larger student–teacher divergence in the tail than in the early portion of the response (Yang et al., 4 Jun 2026).

A second family is implicit multimodal OPRD. "ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation" keeps the OPD paradigm but changes what is being internalized: the teacher sees visual cues $L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$ 6, while the student never receives cue text and must instead construct an internal recovered cue $L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$ 7 through a sink-token cross-attention module inserted every 5 transformer layers and run only during prefill. The paper explicitly describes this as representation distillation without feature-matching or explicit representation loss, because gradients from the OPD loss shape an internal “cue” representation that lets the student behave as if it had seen the cue (Tian et al., 4 Jun 2026).

A third family is semantic or structured OPRD. "Rubric-based On-policy Distillation" does not use the term OPRD explicitly, but it frames prompt-specific rubrics $L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$ 8 as structured semantic criteria distilled from teacher–student contrasts and then optimized via GRPO. The paper explicitly states that if these structured criteria are treated as representations of good behavior, ROPD is an instance of on-policy representation distillation in which the student is optimized to satisfy semantic rubrics rather than match teacher logits (Fang et al., 8 May 2026).

A fourth family extends the concept beyond autoregressive LLMs. "DiffusionOPD" lifts OPD to continuous-state Markov processes for diffusion models. There, the student is trained on its own latent trajectories, and per-step supervision is a closed-form KL between Gaussian transition kernels with means $L_{\mathrm{OPD}}(\theta)=\mathbb{E}_{y \sim \pi_{\mathrm{mix}}}\left[\sum_{t=1}^{|y|}\mathcal{D}_f\big(p_T(\cdot \mid x,y_{<t}),\,p_\theta(\cdot \mid x,y_{<t})\big)\right],$ 9 and $\pi_{\mathrm{mix}}$ 0; in the deterministic ODE regime this reduces to direct $\pi_{\mathrm{mix}}$ 1 matching of transition means (Li et al., 14 May 2026). This is still on-policy distillation, but the “representation” being matched is the transition structure of the denoising process rather than a next-token distribution.

These families are technically heterogeneous. Some use explicit hidden-state MSE, some use internal cue recovery without an auxiliary cue-generation loss, some use semantic rubrics, and some operate on continuous-state transition means. The literature therefore treats OPRD less as one loss function than as a design principle: move supervision into an internal or structured space while preserving on-policy state visitation.

5. Empirical behavior, efficiency, and scaling

Across the literature, the explicit OPRD formulation and its close relatives are associated with two recurring empirical claims: tighter student–teacher alignment than output-space OPD, and improved efficiency because vocabulary-sized logits need not sit on the critical update path.

Paper	Setting	Reported outcome
"OPRD: On-Policy Representation Distillation" (Yang et al., 4 Jun 2026)	Qwen-based 1.5B same-architecture distillation	Closes the student-teacher gap on AIME 2024/2025 and AIMO; trains $\pi_{\mathrm{mix}}$ 2 faster and uses $\pi_{\mathrm{mix}}$ 3 less memory than top- $\pi_{\mathrm{mix}}$ 4 OPD
"ViCuR" (Tian et al., 4 Jun 2026)	Multimodal OPSD/OPD with Qwen3-VL-2B and 8B students	Improves over answer-based OPSD by $\pi_{\mathrm{mix}}$ 5 and $\pi_{\mathrm{mix}}$ 6 overall average; over stronger-teacher OPD by $\pi_{\mathrm{mix}}$ 7 and $\pi_{\mathrm{mix}}$ 8
"Rubric-based On-policy Distillation" (Fang et al., 8 May 2026)	Black-box OPD with rubric induction and GRPO	Outperforms advanced logit-based OPD methods across most scenarios and achieves up to a $\pi_{\mathrm{mix}}$ 9 gain in sample efficiency
"DiffusionOPD" (Li et al., 14 May 2026)	Multi-task diffusion distillation	Surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance

The dedicated OPRD study is especially notable because it compares directly against output-space OPD variants under the same rollout budget and teacher forward pass. On AIME24/AIME25/AIMO, sampled-token and top-16 OPD improve the student but plateau below the teacher, whereas OPRD almost matches the teacher: $d_{\mathcal D}$ 0 vs $d_{\mathcal D}$ 1 on AIME24, $d_{\mathcal D}$ 2 vs $d_{\mathcal D}$ 3 on AIME25, and $d_{\mathcal D}$ 4 vs $d_{\mathcal D}$ 5 on AIMO. In the same setup, actor-update transient memory falls from $d_{\mathcal D}$ 6 GB for top-16 OPD to $d_{\mathcal D}$ 7 GB for OPRD, and 500-step wall-clock drops from roughly $d_{\mathcal D}$ 8– $d_{\mathcal D}$ 9 minutes to $d_{\pi_\theta}$ 0 minutes (Yang et al., 4 Jun 2026).

Multimodal evidence points in the same direction. ViCuR’s strongest ablation result is that changing the form of privilege from answer-side hints to visual cues is the main source of improvement, and the sink-token recovery module then further improves performance by enabling the student to aggregate query-relevant visual evidence internally (Tian et al., 4 Jun 2026). This suggests that when OPRD works well, it is not merely denoising logits; it is changing what evidence the student learns to encode and where.

6. Limitations, misconceptions, and research directions

A common misconception is that OPRD is synonymous with explicit hidden-state MSE. The literature does not support that narrowing. The surveys describe OPRD more broadly as OPD in latent or representation space, and papers such as ViCuR and ROPD treat internal cue representations or prompt-specific rubrics as the distilled object even when no explicit layer-wise feature loss is introduced (Song et al., 1 Apr 2026, Zhang, 22 Jun 2026). This suggests that “representation” in OPRD is best understood functionally: it is whatever internal structure lets the student reproduce teacher-like competence on its own visited states.

Another misconception is that stronger teachers alone solve OPD. DOPD argues that privileged inputs can create privilege illusion, in which the apparent teacher advantage conflates transferable capability with information asymmetry that the student cannot reproduce at inference. Its remedy is advantage-aware routing between privileged teacher and privileged student supervision, and the paper reports consistent gains over vanilla OPD across LLM and VLM settings (Yu et al., 29 Jun 2026). ViCuR makes a closely related claim in multimodal reasoning: answer-side privilege creates train–test mismatch, whereas visual cues are designed as recoverable privilege because they are grounded in the same inference-time visual input (Tian et al., 4 Jun 2026). EDGE-OPD reaches a similar conclusion in rare-token and identity settings by restricting updates to evidence-supported positions and using guided rollouts so that the target behavior is actually present in on-policy data (Lazaridis et al., 22 May 2026).

Efficiency introduces another constraint. "Lightning OPD" shows that offline on-policy distillation can share the same optimum as standard OPD, but only under teacher consistency, meaning the same teacher must be used for both SFT and OPD; otherwise an irreducible gradient bias appears. Under teacher consistency, Lightning OPD reaches $d_{\pi_\theta}$ 1 on AIME 2024 in $d_{\pi_\theta}$ 2 GPU hours with a $d_{\pi_\theta}$ 3 speedup over standard OPD (Wu et al., 14 Apr 2026). A plausible implication for future OPRD systems is that representation-level objectives may need analogous consistency conditions across pretraining, SFT, and on-policy phases.

Finally, competence-aware supervision is emerging as a separate but convergent theme. SEAD argues that supervision quality in OPD depends on student competence and uses entropy to gate tokens, training phases, and prompts, skipping approximately $d_{\pi_\theta}$ 4 of tokens and improving average accuracy by $d_{\pi_\theta}$ 5 over vanilla OPD across six math benchmarks (Lee et al., 26 Jun 2026). For OPRD, this suggests that uniform representation matching across all layers, tokens, and prompts may be suboptimal; future systems may need entropy-, verifier-, or competence-gated representation losses rather than blanket alignment.

The research agenda identified by the surveys is correspondingly broad: distillation scaling laws, uncertainty-aware feedback, agent-level distillation, multimodal OPRD, adaptive support in representation space, counterfactual routing in representation manifolds, teacher recoverability estimation, and privilege compressibility all remain open (Song et al., 1 Apr 2026, Zhang, 22 Jun 2026). The field has therefore moved beyond the question of whether on-policy representation distillation is possible. The active question is which internal structures should be transferred, under which rollout distributions, with what gating and privilege design, so that teacher competence is internalized without inheriting variance, mismatch, or non-recoverable shortcuts.