Policy Distillation & Unification

Updated 20 April 2026

Policy distillation and unification are techniques that transfer and consolidate policy knowledge from expert models to smaller, efficient students across tasks.
They utilize divergence metrics like MSE, KL, and NLL along with reward regularization to ensure stable training and robust performance.
Advanced variants, including online, group, and peer distillation, offer scalable and efficient solutions for both RL and large-scale language models.

Policy distillation is a broad family of techniques for transferring the behavioral competence and knowledge of one or more trained policies (teachers) to new student policies, typically with the goals of model compression, knowledge unification across tasks, sample efficiency, or robustification. Unification refers to the consolidation of task-specific policies or diverse reasoning traces into a single agent that can handle multiple scenarios or tasks. Modern research has established a rigorous connection between policy distillation and policy optimization, with precise characterizations of the objectives, algorithmic frameworks, convergence properties, and practical outcomes across deep RL and LLMs.

1. Mathematical Foundations and Frameworks

Core policy distillation operates in the Markov decision process (MDP) formalism, aiming to make a student policy $\pi_{\theta}$ match a teacher policy $\pi_T$ as closely as possible under a divergence metric. In the classical setting—e.g., for DQN agents—one collects tuples $(s, q_T(s))$ where $q_T(s)$ encodes the teacher's action-value outputs, and optimizes the student using one of the following losses:

Mean squared error (MSE): $L_{\mathrm{MSE}} = \sum_i \Vert q^T_i - q^S_i \Vert^2$ ,
Negative log-likelihood (NLL): $L_{\mathrm{NLL}} = -\sum_i \log P_S(a^*_i|s_i)$ , with $a^*_i = \mathrm{argmax}_a q^T_i[a]$ ,
Softmaxed KL divergence: $L_{\mathrm{KL}} = \sum_i \mathrm{KL}\big(\mathrm{softmax}(q^T_i/\tau)\ \Vert\ \mathrm{softmax}(q^S_i/\tau)\big)$ , with temperature $\tau$ to control distribution sharpness (Rusu et al., 2015).

From a RL perspective, expected-entropy regularized distillation (EERD) and reverse-KL formulations generalize the objective to dense reward-driven iterative processes, connecting the “imitation” reward $r(s,a) = \log\frac{\pi_T(a|s)}{\pi_S(a|s)}$ to a policy-gradient update:

$\pi_T$ 0

Such formulations allow for sample-efficient updates, proper gradient fields, and low-variance Monte Carlo estimates (Czarnecki et al., 2019). Recent works extend the KL regularization to on-policy settings and show that classic policy distillation is a special case of KL-constrained RL (Yang et al., 12 Feb 2026).

For LLMs, the distillation analogue operates at the token level, defining f-divergence-based objectives over student-generated trajectories, with forward KL, reverse KL, and Jensen-Shannon (JSD) among common divergences. The framework is further generalized to include on-policy, off-policy, and hybrid distillation (Song et al., 1 Apr 2026).

2. Single-Task and Multi-Task Distillation

The canonical policy distillation process (“student–teacher transfer”) entails:

Training a large, task-specific teacher (e.g., DQN) to convergence.
Rolling out the teacher in environment(s) to collect transitions/replay buffers.
Training a smaller student network to match the teacher using one of the abovementioned losses.
(Optional) Evaluating the student in held-out scenarios with low exploration.

Empirical results from Atari benchmarks indicate that students compressed up to 25× ( $\pi_T$ 1 of teacher parameters) can achieve ≥ $\pi_T$ 2 teacher performance; at $\pi_T$ 3 parameters, students can even outperform the teacher, attributed to regularization of noisy Q-values (Rusu et al., 2015).

Policy unification proceeds by extending this method to consolidation of $\pi_T$ 4 distinct teachers $\pi_T$ 5 (trained on different tasks):

Training a multitask student with shared backbone and per-task output heads.
Cycling training through buffers from each teacher, applying the same distillation loss to corresponding heads.
Achieving multi-task generalization superior to direct multi-task DQN (e.g., 116.9% geometric-mean performance vs. single-task teachers in three-game benchmarks) (Rusu et al., 2015).

This paradigm is captured in Table 1:

Method	Teacher Count	Architecture	Aggregate Performance
Single-task	1	Task-specific net	100%
Multi-task DQN	$\pi_T$ 6	Shared net, joint RL	83.5% (3-tasks), poor
Multi-distill	$\pi_T$ 7	Shared trunk + heads	up to 117% (3-tasks)

3. Advanced Variants: Online, Group, and Peer Distillation

Recent frameworks expand distillation beyond static, expert-driven teacher-student transfer:

Online distillation (OPD): Student and teacher learning occur concurrently, with the student always distilled from the freshest teacher weights, improving both speed and compression effectiveness (Sun et al., 2019).
Group/Peer distillation: Policies (students) transfer knowledge to each other using attention-weighted group targets or dual-advantage based weighting. Approaches like decision-attention-based OPD (Yu et al., 2024) and Dual Policy Distillation (DPD) (Lai et al., 2020) provide dense, adaptive inter-policy knowledge sharing, enabling collaborative learning with or without expert teachers.
Distillation to interpretable policies: Neural-to-tree policy distillation transfers black-box policies to rule-based decision trees with an advantage-weighted objective, extracting compact, verifiable if-then rules (Li et al., 2021).

In all such settings, distillation acts as a regularized surrogate for more stable, sample-efficient, or interpretable policy transfers compared to pure RL.

4. Unification with Policy Optimization and RL

Policy distillation is now rigorously unified with KL-constrained policy optimization. On-policy distillation (OPD) is shown to be a concrete instance of KL-regularized RL, with the reward at each token or step given by the log-likelihood ratio of teacher and a reference policy:

$\pi_T$ 8

where $\pi_T$ 9, and $(s, q_T(s))$ 0 are trade-off parameters (Yang et al., 12 Feb 2026).

The G-OPD framework allows for:

Reward extrapolation ( $(s, q_T(s))$ 1): The student is explicitly encouraged to surpass teacher peaks.
Reward correction: Using the teacher's pre-RL checkpoint as $(s, q_T(s))$ 2 for more accurate signal.
Unified treatment of RL and distillation: On-policy distillation can interpolate between imitation and exploration, yielding stable convergence and allowing for multi-domain merging (Yang et al., 12 Feb 2026, Ko et al., 11 Mar 2026).

5. Large-Scale Policy Distillation in LLMs: On-Policy, Preference, and Sample Routing

For LLMs, on-policy distillation (OPD) recasts autoregressive sequence modeling as policy optimization, with the student generating its own rollouts and distillation losses computed on these outputs to mitigate exposure bias and compounding error inherent in pure behavior cloning (Song et al., 1 Apr 2026). Key axes of contemporary LLM policy distillation:

Feedback signal: Logit-based KL, reward-model-based discrimination, or dense preference signals (e.g., odds-ratio optimization).
Teacher access: White-box (logits), black-box (text outputs), teacher-free/self.
Loss granularity: Token-, sequence-, or hybrid-level objectives.

Current methods span:

Token-level KL (white-box): e.g., DistiLLM, Generalized KD, adaptive divergence OPD.
Preference optimization (ORPO-Distill): Odds-ratio preference over entire reasoning traces, using a mixed on-/off-policy negative generation to overcome static preference bias (Singh et al., 29 Sep 2025).
Sample routing and self-distillation: SRPO unifies coarse sequence-level RL (group-reward PPO) and dense logit-level corrections by routing correct samples to one and failed samples to the other, further stabilized by entropy-aware weighting (Li et al., 2 Apr 2026).
Efficient scaling and robustness: Techniques such as relaxed masking, entropy-guided token selection, and mixture-clipped rewards stabilize OPD and allow smaller models to efficiently match or exceed larger teachers in mathematical, visual, and agentic benchmarks (Ko et al., 11 Mar 2026).

Notably, on-policy and sample-routed policy distillation mechanisms empirically outperform both pure RL and classical off-policy distillation for reasoning and instruction-following, and have become central to compression and unification pipelines in industrial-scale LLM post-training (Song et al., 1 Apr 2026).

6. Policy Distillation in Multi-Agent, Lifelong, and Multi-Task Regimes

Behavior distillation and lifelong learning: PolyTask demonstrates that modular expert training followed by offline regression-based distillation into a single conditioning-variable-driven policy can prevent catastrophic forgetting and enable seamless task addition, with competitive or superior task coverage across continual learning and robotic settings (Haldar et al., 2023).
Multi-agent and peer-distillation: Online frameworks where $(s, q_T(s))$ 3 policies distill mutual knowledge preserve diversity and enable robust, decentralized knowledge propagation without the requirement for any single high-capacity teacher, with demonstrated gains over naive aggregation or independent training (Yu et al., 2024).

Distinct strengths of policy distillation in these regimes include compositional scalability, sample efficiency, and preservation of adaptability and capacity constraints. Approaches are now being further generalized to accommodate heterogeneous architectures, modalites, and cross-domain policy aggregation.

7. Practical Considerations, Limitations, and Open Problems

Stability and sample efficiency: Trust-region regularization (e.g., PPD), dynamic masking, and mixture-based reward clipping stabilize distillation versus policy collapse—crucial for deep or capacity-limited students (Spigler, 2024, Ko et al., 11 Mar 2026).
Teacher quality and suboptimality: Frameworks like advantage-gated DPD and peer distillation target the “advantageous” knowledge of peer policies, mitigating negative transfer from imperfect teachers (Li et al., 2021, Lai et al., 2020).
Capacity and computation: Real-time and online distillation mechanisms substantially reduce training wall-clock, enabling extremely high compression with minimal loss (Sun et al., 2019).
Open challenges: Scaling law characterization, uncertainty-aware distillation schedules, curriculum and adaptive feedback, unified latent-space matching for cross-architecture or cross-modal distillation, and robust, out-of-distribution evaluation remain unsolved (Song et al., 1 Apr 2026).

Policy distillation and its generalizations serve as practical, theoretically unified frameworks for efficient policy transfer, compression, multi-tasking, and robust knowledge consolidation across both RL and LLM domains, with continuing evolution as new architectures and learning paradigms emerge.