On-Policy Context Distillation (OPCD)

Updated 26 February 2026

On-Policy Context Distillation (OPCD) is a framework that internalizes context-specific behaviors by aligning student models with context-conditioned teacher outputs using on-policy sampling and reverse KL divergence.
It enhances performance across tasks like mathematical reasoning, instruction distillation, and reinforcement learning while significantly reducing computational costs.
The algorithm involves student rollouts, token-level divergence minimization, and gradient updates to efficiently align model behavior without extensive context at inference.

On-Policy Context Distillation (OPCD) is a framework for knowledge distillation and generalization in neural sequence models—including LLMs, vision-LLMs (VLMs), and reinforcement learning agents—where the student model is aligned with a context-conditioned teacher on its own distribution of generated trajectories. OPCD internalizes beneficial in-context behaviors (reasoning traces, instructional prompts, or domain-specific system prompts) directly in the model parameters, eliminating the need for extensive context at inference and addressing distribution mismatch between training and test-time behavior (Ye et al., 12 Feb 2026, Zhao et al., 26 Jan 2026, Bousselham et al., 27 Oct 2025, Snell et al., 2022, Zhang et al., 16 Feb 2026).

1. Formulation and Distillation Objective

OPCD contrasts with classic “off-policy” distillation—in which the student mimics teacher outputs on a static dataset—by requiring the student to sample its own outputs during training, and then matching each prefix of those trajectories to the teacher, which may have access to additional privileged context.

Given an input $x$ , optional privileged context $c$ (e.g., solution trace, system prompt), and model parameters $\theta$ , let $\tau = (y_1, ..., y_T)$ be a student-sampled trajectory:

Student policy: $\pi_\theta(\cdot|x)$ (sees only $x$ ).
Teacher policy: $\pi_T(\cdot|c,x)$ (sees $c,x$ ).

The canonical OPCD objective, with reverse KL at each generation step, is

$\mathcal{L}_{\text{OPCD}}(\theta) = \mathbb{E}_{(x,c)\sim\mathcal{D},\, y\sim\pi_\theta(\cdot|x)} \Bigg[ \frac{1}{|y|} \sum_{t=1}^{|y|} D_{\mathrm{KL}}\big( \pi_\theta(\cdot|x,y_{<t}) \,\|\, \pi_T(\cdot|c,x,y_{<t}) \big) \Bigg]$

where $y_{<t}$ is the prefix generated by the student. The teacher can be a separate (often larger or context-enriched) model, or the same base model conditioned differently (“self-distillation”) (Zhao et al., 26 Jan 2026). The token-level KL divergence is typically evaluated over logits from the full vocabulary (or a top-k subset for efficiency) (Ye et al., 12 Feb 2026).

2. Algorithmic Implementation

The typical OPCD pipeline involves on-policy sampling, context construction, and token-level divergence minimization:

Batch Sampling: Input-context pairs $(x, c)$ are drawn from a dataset (real or synthetic) (Ye et al., 12 Feb 2026, Snell et al., 2022).
Student Rollouts: For each $x$ , the student model generates output $y$ using its current parameters ( $y \sim \pi_\theta(\cdot|x)$ ).
Teacher Targets: The teacher computes (often with no gradients) token distributions for $y$ 's prefixes, conditioned on $(c, x)$ .
Loss Computation: At each generation step, reverse KL between student and teacher token distributions is accumulated.
Gradient Update: Parameters are updated via gradient descent on the aggregated token-level losses (Zhao et al., 26 Jan 2026, Ye et al., 12 Feb 2026).
Inference Mode: The student discards $c$ and is evaluated using only the minimal input $x$ (Snell et al., 2022).

A variant, “on-policy prefix distillation,” restricts token-level supervision to the first $L_{\text{train}}$ tokens per generation to accelerate training—yielding $2$– $47\times$ compute savings versus full-sequence distillation with little performance loss (Zhang et al., 16 Feb 2026).

Pseudocode Outline

for (x, c) in dataloader:
    y = student.sample(x)
    for t in range(len(y)):
        student_logits = student(x, y[:t])
        teacher_logits = teacher(c, x, y[:t])  # context c privileged
        loss += KL(student_logits, teacher_logits)
    update(student, loss)

3. Applications and Use Cases

OPCD has been applied to a diverse set of sequence modeling scenarios:

Mathematical Reasoning: Leveraging solution traces as privileged context, with the teacher induced to rationalize and guide student generation steps. OPCD closes the performance gap between SFT and RL with a fraction of the computational budget (Zhao et al., 26 Jan 2026, Zhang et al., 16 Feb 2026).
Instruction and Prompt Internalization: Models absorb benefits of detailed prompt injections, scratchpads (“chain-of-thought”), or in-context exemplars, so that only minimal input is needed at test time (Snell et al., 2022).
System Prompt Distillation: Specialized system behaviors (e.g., for safety classification or domain expertise) are distilled into the student, obviating engineered prompts at deployment (Ye et al., 12 Feb 2026).
Vision-Language Transfer: Reasoning skills are transferred from powerful text-only LLMs to VLMs, by aligning vision-language students on their own multimodal outputs with the text-only teacher, requiring “cold-start” alignment for effective transfer (Bousselham et al., 27 Oct 2025).
Policy Distillation in RL: Actor distillation in on-policy settings (e.g., PPO, RLHF), though often implemented as distinct phases, aligns with the same principles (Green et al., 2019).

4. Theoretical Properties and Distillation Dynamics

OPCD operates fundamentally on the student's own distribution: this eliminates “exposure bias” intrinsic to off-policy distillation and aligns training with inference-time distributions (Ye et al., 12 Feb 2026, Zhao et al., 26 Jan 2026). The use of reverse KL ( $\mathrm{KL}[\text{student} \| \text{teacher}]$ ) is mode-seeking: it prefers matching the high-confidence outputs of the teacher and discourages unnecessary entropy in the student (Ye et al., 12 Feb 2026).

If self-distillation is used, stability depends on decoupling updates: e.g., freezing the teacher parameters or leveraging nontrivial context differentials between teacher and student (Zhao et al., 26 Jan 2026). For cross-model or cross-modality distillation (e.g., LLM→VLM), cold-start supervised fine-tuning is critical for alignment, as unaligned rollouts can yield uninformative KL gradients (Bousselham et al., 27 Oct 2025).

Variants with partial-context supervision or reward-masked distillation (distill only on failures) balance exploration and imitation in RL setups (Bousselham et al., 27 Oct 2025, Zhang et al., 16 Feb 2026).

5. Empirical Gains and Benchmarking

Across a wide empirical spectrum, OPCD yields state-of-the-art or highly competitive performance, with significant efficiency and generalization advantages:

Method / Task	Core Baseline	With OPCD	Compute Benefit
Math reasoning, Qwen3-4B (Zhao et al., 26 Jan 2026)	49.6% (GRPO)	50.6%	4–8× tokens
Math reasoning, Qwen3-8B (Zhang et al., 16 Feb 2026)	23.3% (Full)	30.8%	47× FLOP (prefix)
Instruction distill (Snell et al., 2022)	43.4 Rouge-L	34.7 (post)	11× fewer tokens
Safety Classification (Ye et al., 12 Feb 2026)	77.2 (CD)	79.6	—
VLM transfer (MMMU-Pro) (Bousselham et al., 27 Oct 2025)	27.1%	32.0%	—

This efficiency does not trade off generalization: out-of-distribution capabilities are notably preserved or improved compared to off-policy or context-heavy approaches (Ye et al., 12 Feb 2026, Zhang et al., 16 Feb 2026).

6. Variations, Ablations, and Best Practices

Empirical ablations establish:

Scale Effects: Larger models benefit more from OPCD, with gains increasing with capacity (Zhao et al., 26 Jan 2026).
Prefix Truncation: Shorter distillation prefixes speed up training, but prefix scheduling is essential to avoid tail degeneration especially in compact students (Zhang et al., 16 Feb 2026).
Reward-Guided Masking: In RL, KL divergence applied only to failed trajectories avoids over-imitation and supports exploration (Bousselham et al., 27 Oct 2025).
Self- vs Teacher-Student Distillation: Teacher-student OPCD, especially with a frozen and larger teacher, outperforms purely endogenous (self-distill) setups on most tasks (Ye et al., 12 Feb 2026).
Practical Stabilizers: Teacher-forcing, prompt engineering, mixing with original instruction-tuning data, and reducing KL support to top- $k$ tokens improve stability and efficiency (Snell et al., 2022, Ye et al., 12 Feb 2026).

7. Limitations and Practical Considerations

OPCD requires the teacher and student policies to share a compatible vocabulary or tokenizer. Insufficient initial alignment (“cold-start”) can yield flat, uninformative teacher distributions that slow learning or destabilize gradients, especially when teacher and student are heterogeneous (e.g., LLM–VLM) (Bousselham et al., 27 Oct 2025). Truncated prefix distillation, if too aggressive, may under-train long-horizon reasoning in small models (Zhang et al., 16 Feb 2026).

Optimally configuring rollout length, reward masking, and reverse KL computation is necessary to balance compute, convergence, and generalization. Over-imitating the teacher may suppress emergent strategies unique to the student; exploration-aware masking mitigates this (Bousselham et al., 27 Oct 2025).

References:

(Ye et al., 12 Feb 2026): "On-Policy Context Distillation for LLMs"
(Zhao et al., 26 Jan 2026): "Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs"
(Zhang et al., 16 Feb 2026): "Fast and Effective On-policy Distillation from Reasoning Prefixes"
(Bousselham et al., 27 Oct 2025): "VOLD: Reasoning Transfer from LLMs to Vision-LLMs via On-Policy Distillation"
(Snell et al., 2022): "Learning by Distilling Context"
(Green et al., 2019): "Distillation Strategies for Proximal Policy Optimization"