Self-Distilled Reasoner: Teaching Models to Learn from Themselves

This presentation explores a novel approach to improving large language model reasoning through on-policy self-distillation. The method uses a single model as both teacher and student, leveraging privileged information during training to provide dense token-level supervision without requiring external teacher models or expensive sampling procedures.
Script
What if a language model could teach itself to reason better by acting as both student and teacher? This fascinating possibility drives today's paper on self-distilled reasoners for large language models.
Let's start by understanding the core challenge the authors tackle.
Building on this challenge, existing methods for improving reasoning face significant limitations. Reinforcement learning approaches like GRPO need to sample 8 responses per problem and often get zero learning signal when all answers are wrong.
This motivates the central research question: can we eliminate the need for separate teacher models while maintaining the benefits of on-policy training? The authors propose an elegant solution through self-distillation.
Now let's explore their innovative approach to self-distillation.
The key insight is using the same model parameters in two different contexts. The student policy sees only the problem, just like during inference, while the teacher policy gets to see the ground-truth solution and provides guidance.
The training process is remarkably efficient compared to alternatives. For each problem, the student generates just one response, and the teacher provides dense feedback at every token position using the same underlying model.
Let's dive into the mathematical foundation that makes this approach work.
The mathematical foundation uses generalized Jensen-Shannon divergence to compare student and teacher token distributions. Critically, gradients only flow through the student logits, treating the teacher as a fixed target.
Implementation-wise, the teacher policy gets a special prompt that includes the reference solution and asks it to rationalize the student's attempt. Using the initial policy as the teacher provides crucial training stability.
Now let's examine how this approach performs in practice.
The authors tested their approach on challenging math reasoning benchmarks using Qwen models of different sizes. They compared against established baselines including GRPO reinforcement learning and standard supervised fine-tuning.
The results show consistent improvements across model sizes, with the 8B model achieving over 2 percentage points improvement. Remarkably, OPSD matches or beats GRPO while being far more efficient.
Perhaps most impressive is the efficiency gain: OPSD achieves better results while using 4 to 8 times fewer tokens than GRPO. This dramatic reduction in computational cost makes the approach much more practical.
Several important insights emerge from their detailed analysis.
The authors discovered that self-distillation effectiveness scales with model size, requiring sufficient capacity for the teacher to meaningfully rationalize student attempts. Full vocabulary distillation also outperforms more targeted approaches.
This comparison highlights why OPSD represents a significant advancement: it achieves the benefits of on-policy distillation without the computational overhead of separate teacher models or expensive sampling procedures.
The authors also honestly acknowledge several important limitations.
The authors acknowledge that their experiments only go up to 8 billion parameters, and it's unclear whether the benefits persist at much larger scales. The method also doesn't explicitly verify answer correctness during training.
Future work could address these limitations by testing on larger models, incorporating answer verification, and using curriculum learning to gradually increase problem difficulty as the model improves.
Let's consider what this research means for the broader field.
This work has broader implications beyond just improving math reasoning. By making advanced training techniques more computationally accessible, it democratizes access to state-of-the-art reasoning improvements for researchers with limited resources.
The self-distilled reasoner represents an elegant solution to a fundamental challenge: how to provide rich supervision without expensive computation. This work shows that sometimes the best teacher is yourself, given the right context. Visit EmergentMind.com to explore more cutting-edge research in AI and machine learning.