On-Policy Full-Vocabulary Self-Distillation
- The paper introduces a unified paradigm where a single model acts as both student and teacher by minimizing full-vocabulary divergence over on-policy rollouts.
- It employs divergence measures like KL and Jensen–Shannon to overcome exposure bias, ensuring dense per-token feedback and robust alignment with inference behavior.
- Empirical results show enhanced performance in mathematical reasoning, code synthesis, and multimodal QA, with improved token efficiency and stability.
On-policy full-vocabulary self-distillation (OPSD-FV) denotes a training paradigm in which a single LLM functions both as a student and as a privileged teacher, minimizing a divergence between their next-token distributions over the entire vocabulary along trajectories sampled from the student’s own on-policy rollouts. Unlike off-policy distillation, which trains on fixed demonstration prefixes, OPSD-FV eliminates distribution mismatch by providing dense, per-token, full-vocabulary feedback on the student’s own sampled sequences, typically through KL or Jensen–Shannon divergences. This results in significantly improved token efficiency, stability, and alignment with inference-time behavior in a range of domains, including mathematical reasoning, code synthesis, multimodal QA, and vision-language-action models (Cui et al., 18 May 2026, Zhao et al., 26 Jan 2026, Zheng et al., 26 May 2026, Yang et al., 30 May 2026, Li et al., 14 May 2026, Oh et al., 8 May 2026, Liu et al., 27 May 2026, Liu et al., 29 May 2026, Liu et al., 2 Jun 2026, Ke et al., 13 May 2026, Zhong et al., 27 Mar 2026, Hübotter et al., 28 Jan 2026, Yuan et al., 18 May 2026).
1. Conceptual Foundations and Motivation
OPSD-FV originated from the need to overcome the distribution mismatch (“exposure bias”) in off-policy supervised fine-tuning (SFT), where models are trained on demonstration prefixes but must operate on their own, potentially erroneous generations at inference. By conditioning the “student” policy on only the problem statement and its prefix and the “teacher” policy on privileged context (e.g., full reasoning trace, ground-truth answer, feedback), OPSD-FV aligns the training distribution exactly with inference by always distilling on on-policy rollouts (Cui et al., 18 May 2026, Zhao et al., 26 Jan 2026).
Key conceptual points:
- Single-Model, Multi-Context Roles: One parameter set, π_θ, serves as both the teacher (privileged context) and student (problem-only).
- Distribution-Matched Training: Rollouts y ∼ π_θS(·|x) are generated on-policy. The distillation loss is computed between the student and teacher distributions on these prefixes.
- Full-Vocabulary Divergence: At each time step, supervision aligns the entire token distribution, not just the sampled token or hard argmax.
2. Mathematical Formulation of the Loss
The canonical OPSD-FV objective is a full-vocabulary f-divergence between the teacher and student next-token distributions, summed across the student’s own trajectory:
where (student context) and (teacher context). In practice, a full KL, reverse-KL, or symmetric Jensen–Shannon divergence is used. Only the student branch receives gradient updates; the teacher is evaluated under stop-gradient, often with EMA weights for extra stabilization (Cui et al., 18 May 2026, Zhao et al., 26 Jan 2026).
Variants appearing in the literature include:
- Reverse-KL (mode-seeking, used in multimodal and vision-action settings) (Zhong et al., 27 Mar 2026, Yuan et al., 18 May 2026)
- Forward-KL (mode-covering, less stable in OOD states)
- Jensen–Shannon Divergence (symmetric soft alignment, common in most LLM applications)
3. Algorithmic Realizations and Implementation
A generic OPSD-FV training loop comprises the following steps (Cui et al., 18 May 2026, Zhao et al., 26 Jan 2026, Zheng et al., 26 May 2026):
- Data and Context Construction: For each problem x, retrieve ground-truth solution y*, feedback f, or privileged context c as available.
- Student Rollout: Sample a trajectory via autoregressive generation from the student policy π_θS.
- Teacher Policy Computation: For each token position, compute teacher logits under the privileged context; student logits under standard context.
- Full-Vocabulary Loss Computation: Compute per-token divergence over the full vocabulary, e.g.,
- Loss Aggregation and Update: Average over positions and sequences, backpropagate only through the student policy.
Optimization is typically performed using AdamW with learning rates 1e–5 to 5e–5. GPU memory is reduced by 40–60% over standard teacher-student OPD via parameter sharing, mixed-precision evaluation for the teacher, and cache key/value reuse (Cui et al., 18 May 2026).
Some modern variants include reliability weighting (e.g., MAIGO) (Zheng et al., 26 May 2026), memory-efficient top-K approximations for full-vocab loss (Yuan et al., 18 May 2026), and two-stage temperature scaling for entropy management (Yang et al., 30 May 2026).
4. Supervision Fidelity, Stabilization, and Theoretical Issues
Supervision Fidelity Decay (SFD):
On long reasoning chains, the teacher’s predictive distribution degrades (high entropy), attenuating corrective signal and enabling self-reinforcing student drift (“compounding error”) (Liu et al., 29 May 2026). Theoretical analyses bound the signal-to-noise ratio in teacher log-probs given drifted student states.
Stabilization Strategies:
- Lookahead Group Reward (LGR): Adds a one-step lookahead confidence signal to select next tokens that sustain teacher discriminability, using normalized confidence as a reward atop standard KL (Liu et al., 29 May 2026).
- Variance Reduction: vOPD introduces a control variate baseline—the analytic per-token reverse KL—as a stop-gradient baseline in the policy-gradient form, yielding strongly reduced gradient variance with no bias (Oh et al., 8 May 2026).
- Entropy-Guided Gating: EGRSD modulates the per-token distillation loss by teacher entropy, downweighting update magnitude in high-entropy teacher states while maintaining a nonzero lower bound (Ke et al., 13 May 2026).
- Hybrid EM Schemes: VPD dynamically adapts the feedback-conditioned teacher via a variational EM loop, preventing stalling from passive or poorly calibrated teachers (Li et al., 14 May 2026).
Objective Selection:
Mode-seeking reverse-KL is preferred in open-ended, high-entropy or OOD settings (VLA, fine-grained vision), while the symmetric JS is common for language reasoning, as it balances drift avoidance with exploration (Zhao et al., 26 Jan 2026, Zhong et al., 27 Mar 2026, Yuan et al., 18 May 2026).
5. Applications, Model Variants, and Empirical Results
Reasoning and Mathematical Benchmarks:
- OPSD-FV outperforms SFT and matches or exceeds strong RLVR baselines (e.g., GRPO) while being 4–8× more token efficient (Zhao et al., 26 Jan 2026).
- On Qwen3-8B, OPSD achieves average@16 accuracy 2 points above SFT in math competitions (AIME, HMMT) (Zhao et al., 26 Jan 2026).
Multimodal and Robotic Control:
- Vision-OPD distills crop-conditioned privileged perception into full-image policies via the same full-vocabulary loss, yielding single-pass accuracy gains of 7–9 points over standard MLLM QA and outperforming much larger models on fine-grained tasks (Yuan et al., 18 May 2026).
- VLA-OPD applies OPSD-FV to robotic policies, achieving robust transfer, sample efficiency gains (3×), and resilience against catastrophic forgetting compared to SFT or pure RL (Zhong et al., 27 Mar 2026).
Dialogue and Multi-Turn Scenarios:
- MAIGO employs history-cleaned teacher contexts for reducing self-contamination in multi-turn tasks, improving sharded/full accuracy ratios from 66.5% to 84.1% on Qwen2.5-7B-Instruct (Zheng et al., 26 May 2026).
RL with Entropy Collapse:
- TS-OPSD reheats policy entropy via self-distillation with high-temperature scaling, restoring healthy exploration dynamics and improving continued RL convergence by 0.6–1.1 points on hard math benchmarks (Yang et al., 30 May 2026).
Stabilized Policy Optimization:
- SDPG augments RLVR with an exact full-vocab reverse KL self-distillation term, group-relative advantages, and reference-policy KL regularization. SDPG achieves up to 9 points greater accuracy than RL baselines, with faster and more stable convergence (Liu et al., 2 Jun 2026).
Empirical results across code, reasoning, science, and multimodal benchs show systematic improvements in accuracy, stability, and token efficiency (Hübotter et al., 28 Jan 2026, Li et al., 14 May 2026, Ke et al., 13 May 2026).
6. Extensions, Limitations, and Future Directions
- Scale limitations exist: Most studies are in the 1.7–8B parameter regime. Scaling behavior beyond this is actively being investigated (Zhao et al., 26 Jan 2026).
- Teacher signal collapse (e.g., SFD) poses open problems as student policies drift out of teacher support. Methods such as lookahead, entropy gating, and active reference adaptation are under active exploration (Liu et al., 29 May 2026, Ke et al., 13 May 2026, Li et al., 14 May 2026).
- Hybrid objectives: Combining full-vocab self-distillation with outcome-level rewards, cross-entropy masking, or curriculum learning may boost both correctness rates and robustness, especially on tasks with sparse or binary feedback (Zhao et al., 26 Jan 2026, Liu et al., 2 Jun 2026).
- Internal self-distillation: OISD transfers full-vocab distillation signals from the final layer to intermediate layers, enhancing both “how to think” (logit-level) and “where to look” (attention) via advantage-weighted Jensen–Shannon losses, yielding up to 10-point gains on Qwen3-4B (Liu et al., 27 May 2026).
- Computational efficiency: Exploits model parameter sharing, activations offloading, top-K approximations, and optimized branching schemes to keep GPU usage tractable (Cui et al., 18 May 2026, Yuan et al., 18 May 2026, Zheng et al., 26 May 2026, Oh et al., 8 May 2026).
7. Representative Algorithms and Comparative Table
The following table briefly summarizes prominent OPSD-FV algorithms, their key context and supervision forms (abbreviated):
| Algorithm | Teacher Context | Supervision Type | Application Domain |
|---|---|---|---|
| OPSD | Ground-truth trace | Full-vocab JSD | Reasoning, math, code (Zhao et al., 26 Jan 2026, Cui et al., 18 May 2026) |
| MAIGO | History-cleaned | Full-vocab JSD | Multi-turn dialogue (Zheng et al., 26 May 2026) |
| vOPD | Fixed teacher | Reverse-KL with baseline | Math, science (Oh et al., 8 May 2026) |
| SDPG | Privileged context | Full-vocab reverse-KL | RLVR, math (Liu et al., 2 Jun 2026) |
| Vision-OPD | Crop vs full image | Full-vocab JSD (top-K) | Multimodal QA (Yuan et al., 18 May 2026) |
| VLA-OPD | Expert teacher | Full-vocab reverse-KL | Robotic control (Zhong et al., 27 Mar 2026) |
| EGRSD | Ground-truth trace | Entropy-gated KL | LLM reasoning (Ke et al., 13 May 2026) |
| SDPO, VPD | Feedback-in-context | Full-vocab KL (EM loop) | Code, RL with feedback (Li et al., 14 May 2026, Hübotter et al., 28 Jan 2026) |
All methods share the core attributes of on-policy generation, per-token full-vocabulary divergence, and supervision from either privileged context or auxiliary information accessible only to the teacher role.
On-policy full-vocabulary self-distillation constitutes a unified, distribution-matched, and computationally efficient recipe for aligning LLMs to privileged reasoning, dense feedback, or externally enhanced perspectives on their own outputs. Current and ongoing research explores its scaling, stability, and hybridization with reinforcement learning and structural feedback to advance state-of-the-art in reasoning, multimodal understanding, and adaptive policy learning (Cui et al., 18 May 2026, Zhao et al., 26 Jan 2026, Liu et al., 29 May 2026, Li et al., 14 May 2026, Zheng et al., 26 May 2026, Oh et al., 8 May 2026, Liu et al., 2 Jun 2026, Liu et al., 27 May 2026, Yuan et al., 18 May 2026, Zhong et al., 27 Mar 2026, Hübotter et al., 28 Jan 2026, Yang et al., 30 May 2026, Ke et al., 13 May 2026).