Offline On-Policy Knowledge Distillation

Updated 6 October 2025

Offline on-policy knowledge distillation is a method where a fixed, pre-trained teacher model provides static outputs guiding the student model's learning.
It employs response-based, feature-based, and relation-based approaches to replicate teacher behaviors, improving model compressibility and inference speed.
Recent innovations integrate self-supervision, in-context retrieval, and curriculum strategies to enhance robustness across computer vision, NLP, and offline reinforcement learning.

Offline on-policy knowledge distillation refers to the class of algorithms and frameworks in which a student model learns from a fixed, pre-trained teacher through the transfer of knowledge signals. This transfer occurs via carefully curated outputs, feature maps, relation structures, or synthetic data extracted from the teacher’s policy, dataset, or reasoning traces. Unlike online or self-distillation, the teacher’s policy and its outputs remain unchanged throughout the distillation process, hence “on-policy” denotes that the supervision is fixed by the pre-existing teacher behavior. Offline KD is widely adopted in computer vision, natural language processing, and offline reinforcement learning, providing significant improvements in generalization, compressibility, and inference speed of student models.

1. Classical Paradigm and Taxonomy

Offline knowledge distillation methods are traditionally formulated as two-stage pipelines (Yang et al., 2023):

Stage 1: Train a high-capacity teacher on the target dataset, yielding a model with superior performance.
Stage 2: Freeze the teacher, then train a student model to align its outputs (logits, feature maps, or relations) with the teacher’s.

KD algorithms are typically categorized as:

Response-Based (“soft-target”) KD: Student mimics the teacher’s soft probability predictions via KL-divergence,

$L_{KD} = \sum_{n=1}^N p(z^T;T)[n] \cdot \log \frac{p(z^T;T)[n]}{p(z^S;T)[n]}$

Feature-Based KD: Student matches intermediate feature maps; losses can include MSE or other distance metrics,

$L_{feature\_kd}(F^S, F^T) = L_{dis}\left(\varphi^S(F^S),\,\varphi^T(F^T)\right)$

Relation-Based KD: Student reproduces higher-order correlations via similarity metrics,

$L_{relation} = \sum_{i,j} L_{dis}\left(\psi^S(v^S_i, v^S_j),\,\psi^T(v^T_i, v^T_j)\right)$

Offline KD assumes the teacher’s outputs form the “policy” against which the student learns, hence the terminology “offline on-policy” is used in much recent work.

2. Augmented Knowledge Distillation: Hierarchical and In-Context Methods

Recent work augments classical KD forms by introducing self-supervision (Yang et al., 2021), knowledge decomposition (Zhang et al., 2021), and retrieval-based regularization (Zhu et al., 13 Jan 2025).

Hierarchical Self-Supervision

The Hierarchical Self-Supervision Augmented Knowledge Distillation (HSSAKD) framework (Yang et al., 2021) leverages auxiliary self-supervision signals (e.g., transformations of inputs) by forming a joint label space $\mathcal{N}\times\mathcal{M}$ , where $|\mathcal{N}|$ is the number of classes and $|\mathcal{M}|$ is the number of transformations. Teacher and student models attach auxiliary branches at multiple depths for predicting these augmented distributions,

$q(t_j(x); \tau) = \sigma \left(\frac{W^T\Phi(t_j(x))}{\tau} \right) \in \mathbb{R}^{N\times M}$

The distillation loss aggregates KL divergences at multiple branches, guiding the student to absorb representational cues from various abstraction levels.

In-Context Sample Retrieval

The In-Context Knowledge Distillation (IC-KD) (Zhu et al., 13 Jan 2025) introduces retrieval of similar and contrasting (“positive” and “negative”) samples from a teacher feature memory bank. PICD and NICD losses respectively align the student’s outputs with aggregated teacher logits over similar samples and separate them from negatives,

$L_{PICD} = KL(\hat{p}_i^S(\tau_1) \| p_i^S(\tau_1))$

$L_{NICD} = 1 - \cos(p_i^S, p_i^T) + b_{ij} \cdot \cos(p_i^S, p_j^T)$

This method exploits inter-sample relationships rather than relying on sample-wise distillation.

3. Knowledge Quantity, Curriculum, and Progressive Distillation

Partial to Whole Knowledge Distillation (PWKD) (Zhang et al., 2021) decomposes a teacher into weight-sharing sub-networks with varying channel widths. The decomposition enables a curriculum strategy where the student gradually learns from “partial” knowledge groups (lower capacity sub-networks) progressing to “whole” knowledge. The distillation loss at each stage is a weighted combination of classification and KL divergence terms,

$L = \alpha \cdot L_{cls}(f_t(x, W_{\rho\times}), y) + (1-\alpha) \cdot L_{kl}(f_t(x, W_{\rho\times}), f_t(x, W_{1.0x}), T)$

where $\rho$ is the channel width factor and $T$ the temperature. Cyclical learning rates are used to accelerate convergence at each stage.

4. Dataset Distillation and Offline RL Knowledge Transfer

Offline knowledge distillation in RL includes synthesizing compact datasets and policy-level transfer mechanisms.

Dataset Distillation

Dataset Distillation for Offline RL (Light et al., 29 Jul 2024) distills a synthetic dataset $\mathcal{D}_{syn}$ by matching the gradient of a behavioral cloning loss induced by the real expert dataset,

$L_{grad~match}(\phi\,|\,\theta) = \mathbb{E}_{\theta \sim p_\theta}\left[\, ||g_{real} - g_{syn}|| \,\right]$

This enables training lightweight models with highly compressed data while maintaining comparable policy performance to full or percentile BC datasets.

Action-Value Weighted Decision Difference

Offline Behavior Distillation (Lei et al., 30 Oct 2024) develops Av-PBC, which weights the MSE between the distilled and near-optimal policies by their action values,

$J(\pi^*) - J(\pi) = \frac{1}{1-\gamma} \mathbb{E}_{s \sim d_\pi}[\, q_{\pi^*}(s,\cdot) \cdot (\pi^*(\cdot|s) - \pi(\cdot|s))\,]$

This linear discount complexity guarantee is significantly tighter compared to quadratic bounds from earlier PBC objectives and improves policy efficiency and generalization.

5. Preference-Guided and RL-Based Offline Distillation

Offline distillation frameworks for LLMs and RL further exploit preference optimization and reward-guided learning.

Reward Model and Min-Max Optimization

Online Knowledge Distillation with Reward Guidance (Jia, 25 May 2025) frames KD as a min-max optimization between policy and a learned reward model (RM), trained on offline preference data. The student policy $\pi$ is optimized to minimize the worst-case gap,

$\hat{\pi} = \arg\min_{\pi \in \Pi} \max_{r \in \mathcal{R}(\mathcal{D}_{pref})} \left[ J(\pi_E, r) - J(\pi, r) \right]$

An extension leverages teacher Q-functions for moment-matching, supporting white-box KD.

Hybrid On-Policy RL with Offline Data

Offline Data Enhanced On-Policy Policy Gradient (Zhou et al., 2023) combines on-policy actor-critic (and NPG) updates with offline Bellman-style regression, yielding doubly robust convergence: the method achieves on-policy robustness and gains sample efficiency if the offline assumptions hold. Policy updates are performed via

$\pi^{t+1}(a|s) \propto \pi^t(a|s)\exp(\eta f^t(s,a))$

and

$\theta^{t+1} = \theta^t + \eta w^t$

for parameterized policies, fusing offline critic and on-policy advantage estimation.

6. Knowledge Distillation via Contrastive Preference and Reasoning Traces

ORPO-Distill (Singh et al., 29 Sep 2025) approaches cross-architecture LLM KD by optimizing student preference for teacher reasoning traces over student errors, via an odds-ratio objective,

$L_{ORPO} = L_{SFT} + \lambda L_{OR}$

where

$L_{OR} = - \log \sigma\left( \log\frac{odds_{q_\theta}(y_p|x)}{odds_{q_\theta}(y_n|x)} \right)$

with $odds_{q_\theta}(y|x) = q_\theta(y|x)/[1-q_\theta(y|x)]$ . A mixed-policy negative sampling strategy is used, probabilistically alternating between off-policy and on-policy (current checkpoint/student-generated) reasoning traces, thus mitigating distribution mismatch and improving generalization.

7. Empirical Evaluation and Impact

Recent empirical studies validate that offline on-policy knowledge distillation frameworks reliably improve student performance across classification, RL, and sequential decision-making tasks (Yang et al., 2021, Zhang et al., 2021, Yang et al., 2023, Xiao et al., 2023, Zhou et al., 2023, Light et al., 29 Jul 2024, Lei et al., 30 Oct 2024, Zhu et al., 13 Jan 2025, Liu et al., 24 Feb 2025, Jia, 25 May 2025, Singh et al., 29 Sep 2025).

Classification Benchmarks: HSSAKD and IC-KD report several percentage points accuracy gains on CIFAR-100 and ImageNet over baseline and competitive KD methods.
RL Benchmarks: RL dataset distillation leads to comparable or improved policy performance using orders-of-magnitude fewer training samples; Av-PBC demonstrates faster convergence and tighter theoretical guarantees.
Sequential Decision-Making: O3D improves LLM agent success rates from ~72% to ~90% in ALFWorld and outperforms in WebShop.
Language Modeling: Path Consistency Learning and odds-ratio preference optimization methods unlock improved latency/accuracy tradeoffs not accessible to earlier speculative decoding or chain-of-thought KD.

Table: Representative Offline On-Policy KD Innovations

Method	Key Innovation	Empirical Domain
HSSAKD (Yang et al., 2021)	Hierarchical self-supervision	Vision (CIFAR, ImageNet)
PWKD (Zhang et al., 2021)	Progressive partial-to-whole KD	Vision (CIFAR, ImageNet)
IC-KD (Zhu et al., 13 Jan 2025)	In-context retrieval regularization	Vision (CIFAR, ImageNet)
Dataset Distillation (Light et al., 29 Jul 2024)	Gradient-matched synthetic data	RL (Procgen)
Av-PBC (Lei et al., 30 Oct 2024)	Action-value weighted decision diff.	RL (D4RL)
O3D (Xiao et al., 2023)	Offline skill discovery/distillation	LLMs, Sequential DM
Reward-Guided (Jia, 25 May 2025)	Min-max over RM, Q-function reform.	LLMs, RL
ORPO (Singh et al., 29 Sep 2025)	Mixed-policy contrastive preference	LLMs (QA, reasoning)

8. Practical Implications and Future Directions

Offline on-policy knowledge distillation frameworks facilitate efficient transfer of discriminative, generalizable knowledge from static teacher models or datasets to compact students. Innovations in hierarchical, curriculum-based, and contrastive preference KD address challenges of capacity mismatch, generalization, and training instability. Empirical evidence supports the adoption of multi-branch, retrieval-based, and RL-inspired objective functions for scalability and improved representation learning.

Future directions include designing adaptive distillation schedules, integrating hybrid offline-online feedback, addressing semantic mismatch in feature-based KD, and scaling preference-based objectives for multimodal and large-scale architectures. The methodological landscape spans computer vision, RL, and foundation model domains, with cross-pollination of techniques such as skill discovery, inter-sample regularization, and policy ensemble distillation.