Papers
Topics
Authors
Recent
Search
2000 character limit reached

On-Policy Full-Vocabulary Self-Distillation

Updated 4 June 2026
  • The paper introduces a unified paradigm where a single model acts as both student and teacher by minimizing full-vocabulary divergence over on-policy rollouts.
  • It employs divergence measures like KL and Jensen–Shannon to overcome exposure bias, ensuring dense per-token feedback and robust alignment with inference behavior.
  • Empirical results show enhanced performance in mathematical reasoning, code synthesis, and multimodal QA, with improved token efficiency and stability.

On-policy full-vocabulary self-distillation (OPSD-FV) denotes a training paradigm in which a single LLM functions both as a student and as a privileged teacher, minimizing a divergence between their next-token distributions over the entire vocabulary along trajectories sampled from the student’s own on-policy rollouts. Unlike off-policy distillation, which trains on fixed demonstration prefixes, OPSD-FV eliminates distribution mismatch by providing dense, per-token, full-vocabulary feedback on the student’s own sampled sequences, typically through KL or Jensen–Shannon divergences. This results in significantly improved token efficiency, stability, and alignment with inference-time behavior in a range of domains, including mathematical reasoning, code synthesis, multimodal QA, and vision-language-action models (Cui et al., 18 May 2026, Zhao et al., 26 Jan 2026, Zheng et al., 26 May 2026, Yang et al., 30 May 2026, Li et al., 14 May 2026, Oh et al., 8 May 2026, Liu et al., 27 May 2026, Liu et al., 29 May 2026, Liu et al., 2 Jun 2026, Ke et al., 13 May 2026, Zhong et al., 27 Mar 2026, Hübotter et al., 28 Jan 2026, Yuan et al., 18 May 2026).

1. Conceptual Foundations and Motivation

OPSD-FV originated from the need to overcome the distribution mismatch (“exposure bias”) in off-policy supervised fine-tuning (SFT), where models are trained on demonstration prefixes but must operate on their own, potentially erroneous generations at inference. By conditioning the “student” policy on only the problem statement and its prefix and the “teacher” policy on privileged context (e.g., full reasoning trace, ground-truth answer, feedback), OPSD-FV aligns the training distribution exactly with inference by always distilling on on-policy rollouts (Cui et al., 18 May 2026, Zhao et al., 26 Jan 2026).

Key conceptual points:

  • Single-Model, Multi-Context Roles: One parameter set, π_θ, serves as both the teacher (privileged context) and student (problem-only).
  • Distribution-Matched Training: Rollouts y ∼ π_θS(·|x) are generated on-policy. The distillation loss is computed between the student and teacher distributions on these prefixes.
  • Full-Vocabulary Divergence: At each time step, supervision aligns the entire token distribution, not just the sampled token or hard argmax.

2. Mathematical Formulation of the Loss

The canonical OPSD-FV objective is a full-vocabulary f-divergence between the teacher and student next-token distributions, summed across the student’s own trajectory:

L(θ)=E(x,y)DEyπθS(x)[t=1ywVPT(whtT)logPS(whtS)]\mathcal{L}(\theta) = \mathbb{E}_{(x, y^*) \sim \mathcal{D}}\, \mathbb{E}_{y \sim \pi_\theta^S(\cdot\mid x)} \left[-\sum_{t=1}^{|y|} \sum_{w \in V} P_T(w\mid h_t^T) \log P_S(w\mid h_t^S)\right]

where htS=fθ(x,y<t)h_t^S = f_\theta(x, y_{<t}) (student context) and htT=fθ(x,y,y<t)h_t^T = f_\theta(x, y^*, y_{<t}) (teacher context). In practice, a full KL, reverse-KL, or symmetric Jensen–Shannon divergence is used. Only the student branch receives gradient updates; the teacher is evaluated under stop-gradient, often with EMA weights for extra stabilization (Cui et al., 18 May 2026, Zhao et al., 26 Jan 2026).

Variants appearing in the literature include:

  • Reverse-KL (mode-seeking, used in multimodal and vision-action settings) (Zhong et al., 27 Mar 2026, Yuan et al., 18 May 2026)
  • Forward-KL (mode-covering, less stable in OOD states)
  • Jensen–Shannon Divergence (symmetric soft alignment, common in most LLM applications)

3. Algorithmic Realizations and Implementation

A generic OPSD-FV training loop comprises the following steps (Cui et al., 18 May 2026, Zhao et al., 26 Jan 2026, Zheng et al., 26 May 2026):

  1. Data and Context Construction: For each problem x, retrieve ground-truth solution y*, feedback f, or privileged context c as available.
  2. Student Rollout: Sample a trajectory yy via autoregressive generation from the student policy π_θS.
  3. Teacher Policy Computation: For each token position, compute teacher logits under the privileged context; student logits under standard context.
  4. Full-Vocabulary Loss Computation: Compute per-token divergence over the full vocabulary, e.g.,

DKL(PT(htT)PS(htS))=wVPT(whtT)logPT(whtT)PS(whtS)D_{\mathrm{KL}}\bigl(P_T(\cdot | h_t^T) \| P_S(\cdot | h_t^S)\bigr) = \sum_{w \in V} P_T(w | h_t^T) \log \frac{P_T(w | h_t^T)}{P_S(w | h_t^S)}

  1. Loss Aggregation and Update: Average over positions and sequences, backpropagate only through the student policy.

Optimization is typically performed using AdamW with learning rates 1e–5 to 5e–5. GPU memory is reduced by 40–60% over standard teacher-student OPD via parameter sharing, mixed-precision evaluation for the teacher, and cache key/value reuse (Cui et al., 18 May 2026).

Some modern variants include reliability weighting (e.g., MAIGO) (Zheng et al., 26 May 2026), memory-efficient top-K approximations for full-vocab loss (Yuan et al., 18 May 2026), and two-stage temperature scaling for entropy management (Yang et al., 30 May 2026).

4. Supervision Fidelity, Stabilization, and Theoretical Issues

Supervision Fidelity Decay (SFD):

On long reasoning chains, the teacher’s predictive distribution degrades (high entropy), attenuating corrective signal and enabling self-reinforcing student drift (“compounding error”) (Liu et al., 29 May 2026). Theoretical analyses bound the signal-to-noise ratio in teacher log-probs given drifted student states.

Stabilization Strategies:

  • Lookahead Group Reward (LGR): Adds a one-step lookahead confidence signal to select next tokens that sustain teacher discriminability, using normalized confidence as a reward atop standard KL (Liu et al., 29 May 2026).
  • Variance Reduction: vOPD introduces a control variate baseline—the analytic per-token reverse KL—as a stop-gradient baseline in the policy-gradient form, yielding strongly reduced gradient variance with no bias (Oh et al., 8 May 2026).
  • Entropy-Guided Gating: EGRSD modulates the per-token distillation loss by teacher entropy, downweighting update magnitude in high-entropy teacher states while maintaining a nonzero lower bound (Ke et al., 13 May 2026).
  • Hybrid EM Schemes: VPD dynamically adapts the feedback-conditioned teacher via a variational EM loop, preventing stalling from passive or poorly calibrated teachers (Li et al., 14 May 2026).

Objective Selection:

Mode-seeking reverse-KL is preferred in open-ended, high-entropy or OOD settings (VLA, fine-grained vision), while the symmetric JS is common for language reasoning, as it balances drift avoidance with exploration (Zhao et al., 26 Jan 2026, Zhong et al., 27 Mar 2026, Yuan et al., 18 May 2026).

5. Applications, Model Variants, and Empirical Results

Reasoning and Mathematical Benchmarks:

  • OPSD-FV outperforms SFT and matches or exceeds strong RLVR baselines (e.g., GRPO) while being 4–8× more token efficient (Zhao et al., 26 Jan 2026).
  • On Qwen3-8B, OPSD achieves average@16 accuracy 2 points above SFT in math competitions (AIME, HMMT) (Zhao et al., 26 Jan 2026).

Multimodal and Robotic Control:

  • Vision-OPD distills crop-conditioned privileged perception into full-image policies via the same full-vocabulary loss, yielding single-pass accuracy gains of 7–9 points over standard MLLM QA and outperforming much larger models on fine-grained tasks (Yuan et al., 18 May 2026).
  • VLA-OPD applies OPSD-FV to robotic policies, achieving robust transfer, sample efficiency gains (3×), and resilience against catastrophic forgetting compared to SFT or pure RL (Zhong et al., 27 Mar 2026).

Dialogue and Multi-Turn Scenarios:

RL with Entropy Collapse:

  • TS-OPSD reheats policy entropy via self-distillation with high-temperature scaling, restoring healthy exploration dynamics and improving continued RL convergence by 0.6–1.1 points on hard math benchmarks (Yang et al., 30 May 2026).

Stabilized Policy Optimization:

  • SDPG augments RLVR with an exact full-vocab reverse KL self-distillation term, group-relative advantages, and reference-policy KL regularization. SDPG achieves up to 9 points greater accuracy than RL baselines, with faster and more stable convergence (Liu et al., 2 Jun 2026).

Empirical results across code, reasoning, science, and multimodal benchs show systematic improvements in accuracy, stability, and token efficiency (Hübotter et al., 28 Jan 2026, Li et al., 14 May 2026, Ke et al., 13 May 2026).

6. Extensions, Limitations, and Future Directions

7. Representative Algorithms and Comparative Table

The following table briefly summarizes prominent OPSD-FV algorithms, their key context and supervision forms (abbreviated):

Algorithm Teacher Context Supervision Type Application Domain
OPSD Ground-truth trace Full-vocab JSD Reasoning, math, code (Zhao et al., 26 Jan 2026, Cui et al., 18 May 2026)
MAIGO History-cleaned Full-vocab JSD Multi-turn dialogue (Zheng et al., 26 May 2026)
vOPD Fixed teacher Reverse-KL with baseline Math, science (Oh et al., 8 May 2026)
SDPG Privileged context Full-vocab reverse-KL RLVR, math (Liu et al., 2 Jun 2026)
Vision-OPD Crop vs full image Full-vocab JSD (top-K) Multimodal QA (Yuan et al., 18 May 2026)
VLA-OPD Expert teacher Full-vocab reverse-KL Robotic control (Zhong et al., 27 Mar 2026)
EGRSD Ground-truth trace Entropy-gated KL LLM reasoning (Ke et al., 13 May 2026)
SDPO, VPD Feedback-in-context Full-vocab KL (EM loop) Code, RL with feedback (Li et al., 14 May 2026, Hübotter et al., 28 Jan 2026)

All methods share the core attributes of on-policy generation, per-token full-vocabulary divergence, and supervision from either privileged context or auxiliary information accessible only to the teacher role.


On-policy full-vocabulary self-distillation constitutes a unified, distribution-matched, and computationally efficient recipe for aligning LLMs to privileged reasoning, dense feedback, or externally enhanced perspectives on their own outputs. Current and ongoing research explores its scaling, stability, and hybridization with reinforcement learning and structural feedback to advance state-of-the-art in reasoning, multimodal understanding, and adaptive policy learning (Cui et al., 18 May 2026, Zhao et al., 26 Jan 2026, Liu et al., 29 May 2026, Li et al., 14 May 2026, Zheng et al., 26 May 2026, Oh et al., 8 May 2026, Liu et al., 2 Jun 2026, Liu et al., 27 May 2026, Yuan et al., 18 May 2026, Zhong et al., 27 Mar 2026, Hübotter et al., 28 Jan 2026, Yang et al., 30 May 2026, Ke et al., 13 May 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to On-Policy Full-Vocabulary Self-Distillation.