Self-Policy Distillation (SPD)

Updated 26 May 2026

Self-Policy Distillation (SPD) is a framework where a single model acts as both student and teacher by using augmented contexts to generate dense token-level training signals.
SPD improves training efficiency across various domains including language modeling, reinforcement learning, code generation, and diffusion-based image synthesis.
On-policy variants (OPSD) demonstrate stable convergence and significant gains in low-resource settings, enabling effective transfer and model compression.

Self-Policy Distillation (SPD) is a family of learning frameworks in which a single policy network serves as both “student” and “teacher,” with supervision arising from alternate contexts or data augmentations rather than external models. SPD subsumes recent innovations across LLMs, diffusion models, reinforcement learning (RL), code generation, and planning, providing dense, token- or step-level credit assignment via self-generated privileged signals. On-policy self-distillation (OPSD) is now a central paradigm for post-training and transfer in high-resource and low-resource settings, as well as a critical mechanism for capability localization, efficient RL with verifiable rewards, and model compression.

1. Formalization and Canonical Objective

SPD is characterized by using the same model parameters for both student and teacher, with the roles differing only in input context. In the foundational variant, the student policy $\pi_\mathrm{stud}(\cdot|c_s)$ predicts outputs given a base context $c_s$ (e.g., the question alone), while the teacher policy $\pi_\mathrm{teach}(\cdot|c_t)$ predicts outputs with privileged or augmented context $c_t$ (e.g., solution trace, answer, richer context).

The canonical OPSD objective for autoregressive models is: $\mathcal{L}_{\mathrm{OPSD}(\theta) = \mathbb{E}_{(x,y^*) \sim \mathcal{D}} \mathbb{E}_{y\sim\pi_{\mathrm{stud}}(\cdot|x)} \left[ \frac{1}{|y|} \sum_{t=1}^{|y|} D\bigl( \pi_{\mathrm{teach}}(\cdot|c_t, y_{<t}) \,\|\, \pi_{\mathrm{stud}}(\cdot|c_s, y_{<t}) \bigr) \right]$ with $D(\cdot\|\cdot)$ commonly chosen as reverse KL divergence or Jensen–Shannon divergence, and where $y_{<t}$ denotes the prefix of the generated output up to time $t$ . Gradients flow only through the student branch, with the teacher acting as a stop-gradient reference (Zhao et al., 26 Jan 2026, Liu et al., 10 May 2026, Cui et al., 18 May 2026).

This structure recurs in variants targeting model compression, transfer, RL with verifiable rewards (RLVR), capability distillation, and domains beyond language (Sun et al., 2020, Jiang et al., 6 May 2026).

2. Methodological Taxonomy and Algorithmic Forms

SPD encompasses a broad design space, unified by the self-supervised, dense feedback structure. Several key instantiations include:

Contextual OPSD in LLMs: Student generates outputs under base prompt; teacher is exposed to reference solutions or translations (as in crosslingual transfer (Liu et al., 10 May 2026), code feedback (Hübotter et al., 28 Jan 2026), or privileged CoT traces (Zhao et al., 26 Jan 2026)).
Feedback-Driven Self-Distillation: Feedback or error reports from the environment are prepended to the input during the teacher pass (SDPO) (Hübotter et al., 28 Jan 2026).
Capability-Selective Subspace Projection: A low-rank subspace capturing correctness gradients is extracted; key-value activations are projected during self-generation. Distillation occurs on subspace-filtered generations with standard language modeling loss (Hao et al., 21 May 2026).
Evolutionary Self-Distillation in Control: Actions from a noise-perturbed version of the target policy provide higher-performing trajectories, distilled using supervision that minimizes squared error (Sun et al., 2020).
On-Policy Self-Distillation for Diffusion Models: Student conditioned only on text, teacher on text + image; gradients on flow-velocity per ODE step (Jiang et al., 6 May 2026).

A general algorithmic workflow for token-level SPD in LLMs takes the following steps:

For each data point, sample a student rollout on-policy (using $c_s$ ).
At each position, evaluate both student and teacher token distributions.
Accumulate per-token divergence (e.g., KL or JS).
Average over sequence and batch for the loss.
Backpropagate only through the student.
Update parameters (often LoRA or low-rank adapters for efficiency).

For reinforcement learning variants (e.g., RLVR), this workflow is integrated with policy gradient surrogates such as PPO, with token-level divergences added as reward shaping or advantage modifications (Heakl et al., 19 May 2026, Li et al., 2 Apr 2026).

3. Empirical Results and Applications

Applications of SPD span language, code, vision, RL, and crosslingual transfer. Key quantitative outcomes include:

Context	Key Metric / Task	Baseline	SPD Variant	Improvement (Absolute/Relative)
AfriMGSM (multilingual)	Qwen3-1.7B Pass@12	9.18% (GRPO)	15.53% (COPSD)	+70% rel. (Liu et al., 10 May 2026)
Mathematical reasoning	Qwen3-8B Avg@16	51.3% (GRPO)	52.2% (OPSD)	+0.9 pp (Zhao et al., 26 Jan 2026)
Capability subspace (QA→Math)	GSM8K	11% (PSR)	26% (SPD-proj)	+15 pp (Hao et al., 21 May 2026)
RLVR Math/Tool-Use	Chemistry @5h	60% (GRPO)	70.1% (SDPO)	+10.1 pp (Hübotter et al., 28 Jan 2026)
Diffusion (VLM-guided)	Few-step image FID/IS	—	D-OPSD: preserves few-step performance; allows continuous updates (Jiang et al., 6 May 2026)

SPD exhibits several empirically confirmed features:

Marked improvement in low-resource and transfer settings (COPSD: up to +70% relative gain for smallest models)
Training stability and rapid convergence (converges in 20–30 steps) (Liu et al., 10 May 2026)
Outperforms or matches RL baselines in token/sample efficiency, especially under dense feedback
Strong capability localization and out-of-domain transfer in subspace-projected SPD (Hao et al., 21 May 2026)
Preservation of few-step inference and avoidance of “collapse” issues in multimodal and diffusive domains (Jiang et al., 6 May 2026)
Consistently beats sequence-level reward RL (GRPO) in reasoning, exploration, and sample efficiency across mathematical, code, and scientific benchmarks (Zhao et al., 26 Jan 2026, Hübotter et al., 28 Jan 2026, Liu et al., 10 May 2026)

4. Theoretical Properties, Limitations, and Design Considerations

SPD methods benefit from several theoretically motivated properties:

Distribution Alignment: Training on student-sampled rollouts supervised by a privileged teacher directly matches train and test distributions, resolving the “exposure bias” induced by off-policy distillation (Zhao et al., 26 Jan 2026, Cui et al., 18 May 2026).
Monotonic Improvement for Selected Subtasks: In evolutionary and self-imitation settings, SPD can be interpreted as a policy-continuation procedure, guaranteeing no forgetting on previously solved subtasks (Sun et al., 2020).
Convexity under MSE or KL Objectives: The distillation loss is convex in behavioral targets for value and action regression in deterministic settings (Sun et al., 2020).

However, practical and conceptual limits are also documented:

Benefits depend on the model’s ability to rationalize or explain privileged context (yields scale-dependent gains) (Zhao et al., 26 Jan 2026).
Dense token-level SPD may induce “information leakage” if privileged information encodes the reference answer too explicitly—this is shown to degrade performance in some RLVR tasks unless addressed by contrastive or entropy-gated mechanisms (Heakl et al., 19 May 2026, Li et al., 2 Apr 2026).
In SPD variants that operate on already-correct trajectories, arbitrary trajectory matching can introduce optimization ambiguity and late-stage instability (Li et al., 2 Apr 2026).
Computational overhead is often lower than standard knowledge distillation, but still involves double forward passes per token in some settings (teacher and student), unless sharing logits with careful memory/computation management (Cui et al., 18 May 2026).

Design mitigations include sample routing (restricting SDPO to failed rollouts), entropy-aware token weighting/gating, contrastive teacher construction, selection of lower-entropy or more reliable teacher tokens, and curriculum learning based on pass-rate or capability (Li et al., 2 Apr 2026, Ke et al., 13 May 2026, Zhang et al., 21 May 2026).

5. Variants, Extensions, and Emerging Trends

Recent developments broaden SPD:

Contrastive and Direction-Adaptive Distillation: Introduce a “wrong-answer” or contrastive teacher, focusing gradient magnitude only on tokens that decisively advance or obstruct correct reasoning, addressing filler versus decisive steps (Heakl et al., 19 May 2026, Zhang et al., 21 May 2026).
Capability-Selective Projections: Extract and enforce only those directions in intermediate representations most associated with correctness, suppressing spurious stylistic or model-specific artifacts—yielding high out-of-domain transfer (Hao et al., 21 May 2026).
Preference-Based and Reward-Regularized Objectives: PBSD maximizes sample-wise reward-regularized objectives rather than strict teacher matching, achieving provably better steady-state performance under reward-tilted distributions (Yu et al., 6 May 2026).
Entropy-Guided and Causal-Lookahead Gating: Attenuate or drop token-level loss on high-entropy (“uncertain”) tokens, either via linear gates or minimum lookahead to rescue transient decision pivots, thus avoiding premature supervision at branches (Ke et al., 13 May 2026).
Reasoning Compression: SPD for reasoning compression produces concise chains-of-thought without ground-truth supervision by distilling a “be concise” behavior back into the model, yielding substantial token reduction and accuracy gains (Sang et al., 5 Mar 2026).
Vision and Multimodal Applications: SPD instantiates as on-policy text-to-image distillation, aligning the student’s few-step generative path with the privileged teacher’s multimodal-conditioned trajectory, fully preserving original inference efficiency (Jiang et al., 6 May 2026).

The field is evolving toward hybrid architectures (sample routing, multi-modal subspace, adaptive advantage signals), with key patterns of (a) leveraging privileged or compressed contexts as signals, (b) dense, trajectory-tracking supervision, and (c) modularization for stable, self-contained post-training (Cui et al., 18 May 2026).

6. Implementation and Hyperparameter Regimes

Typical SPD/OPSD training regimes follow these settings:

Parameter	Typical Value(s)
Learning Rate	$1 \times 10^{-5}$ (LLMs), $c_s$ 0 (COPSD)
Batch Size	16–64 (LLMs), 32 effective (COPSD)
LoRA rank/α	32–64 / 128
Sampling Temp.	1.1 (student), 1.0 (inference)
Divergence	Reverse KL, Jensen–Shannon ( $c_s$ 1)
Steps/Epochs	20–100 (convergence for COPSD)
Student Length	1024–4096 tokens

Distributed training is standard (8xA100/H200), and memory/compute cost is often reduced by sharing student/teacher parameters and restricting per-token teacher passes to on-policy sampled prefixes. LoRA adaption or low-rank fine-tuning confers parameter efficiency (Liu et al., 10 May 2026, Zhao et al., 26 Jan 2026, Cui et al., 18 May 2026).

7. Impact, Open Directions, and Future Developments

SPD has become essential for efficient, fully self-supervised capability transfer, continual fine-tuning, and robust transfer across modalities and languages. It is central to crosslingual reasoning transfer (COPSD), reward-efficient RLVR post-training, code and reasoning compression, and self-contained capability generalization (Liu et al., 10 May 2026, Yu et al., 6 May 2026, Hao et al., 21 May 2026, Sang et al., 5 Mar 2026). Standard knowledge distillation and RL approaches are increasingly hybridized with SPD to combine outcome-aligned rewards with dense, on-policy signal.

Outstanding challenges include addressing information leakage with dense privileged signals, stabilizing late-stage training (avoiding collapse from entropy inflation or redundancy), and extending SPD methods for rapidly-evolving foundation model scales and architectures. Exploratory proposals involve dynamic teacher context selection, bidirectional or contrastive self-distillation, curriculum-driven privileged augmentation, and integration with learned or retrieval-based uncertainty routing (Zhang et al., 21 May 2026, Ke et al., 13 May 2026).

A plausible implication is that as models scale and self-contextualization abilities strengthen, SPD/OPSD will be critical not only for data- or resource-efficient training, but also for modular alignment, interpretability, and adaptive curriculum in multi-domain agentic architectures.