Create a Video View Paper

On-Policy Context Distillation for Language Models

This presentation introduces On-Policy Context Distillation (OPCD), a fundamental reimagining of how language models internalize contextual knowledge. Rather than relying on costly in-context prompts at runtime, OPCD compresses knowledge directly into model parameters through on-policy learning and reverse KL minimization. This approach eliminates exposure bias, mitigates catastrophic forgetting, and enables persistent accumulation of experiential knowledge—transforming how models adapt and learn from context without inference-time overhead.

Script

What happens when language models memorize training data during reinforcement learning? Every in-context prompt you feed a model at runtime costs memory, latency, and token budget. The researchers behind this work asked: what if we could compress that context directly into the model's parameters, but do it right, without the pitfalls of traditional distillation?

Traditional context distillation treats a context-conditioned model as a teacher and tries to offload that context into a student's weights. But conventional approaches use off-policy learning with forward KL divergence, which creates a mismatch between training and inference and leads to brittle, averaged-out behaviors that fail to generalize.

On-Policy Context Distillation flips this paradigm on its head.

The method constructs a student model that generates text without seeing the context, while a teacher model receives both context and input. The key innovation is reverse KL minimization computed along the student's own trajectories. This ensures the student learns to match the teacher's distribution only within the subspace it actually explores during generation, creating sharp, teacher-aligned behavior without exposure mismatch.

The architecture is elegantly simple. The student produces completions using only the task prompt. At each token position along that student-generated trajectory, we compute the reverse KL between the student's next-token distribution and the teacher's distribution conditioned on the full context. This gradient signal trains the student to internalize what the teacher knows, but only along paths the student would naturally take.

When you compare the two approaches side by side, the difference is stark. Off-policy distillation with forward KL trains the student on teacher-generated text, creating a fundamental mismatch between how the model is trained and how it will be used. It also encourages mode-covering behavior, averaging over multiple possible outputs. OPCD trains on the student's own outputs, eliminating exposure bias, and uses reverse KL to seek out the teacher's preferred modes without dilution.

The most striking application is experiential knowledge distillation. As models solve sequential problems in math reasoning or text-based games, they generate traces that encode high-level strategies. OPCD consolidates these traces into parameters, and validation accuracy climbs in stair-step fashion with each round of knowledge aggregation, demonstrating genuine learning and transfer without requiring the context at test time.

The robustness advantage becomes visible when distilling system prompts for specialized tasks like safety classification. After distilling a safety prompt, OPCD maintains strong performance on both the safety task and on unrelated out-of-distribution medical questions. Off-policy distillation, by contrast, suffers catastrophic forgetting on the medical task. The reverse KL objective prevents the broad mode-covering that overwrites previously learned knowledge.

OPCD opens the door to models that continuously learn from their own deployment. Imagine a system that accumulates experiential traces in production, periodically distills them into a student model, and deploys that student for efficient, context-free inference. The method scales across domains, from reasoning strategies to safety rules, and lays theoretical groundwork by aligning context distillation with policy optimization principles from reinforcement learning.

On-Policy Context Distillation redefines how language models internalize knowledge: not by imitating a teacher's outputs off-policy, but by aligning their own generation process to the teacher's distribution on-policy. Visit EmergentMind.com to explore the full paper and see how this method transforms persistent adaptation in large language models.