REOPOLD: Efficient Policy Distillation
- REOPOLD is a training framework that efficiently transfers reasoning capabilities from a high-capacity teacher to a smaller student model with improved stability and efficiency.
- It employs mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement strategy to mitigate instability and negative transfer.
- Empirical results demonstrate up to 12× sample efficiency improvements and a 3.32× inference speedup across mathematical, visual, and agentic reasoning tasks.
REOPOLD (Relaxed On-Policy Distillation) is a training framework for efficiently transferring reasoning capabilities from a high-capacity teacher model to a smaller student model in the context of sequence modeling and decision-making. REOPOLD addresses the instability and negative transfer present in conventional on-policy distillation by transforming the process into a stabilized policy optimization routine that leverages three central mechanisms: mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD demonstrates substantially improved sample efficiency and inference-time scaling on mathematical, visual, and agentic tool-use reasoning tasks, achieving up to 12× sample efficiency improvements and 3.32× inference speedup over prior baselines on standard benchmarks (Ko et al., 11 Mar 2026).
1. Theoretical Foundations of On-Policy Distillation
On-policy distillation, as formalized in REOPOLD, minimizes the reverse Kullback-Leibler (RKL) divergence from the student policy to the teacher policy , evaluated on trajectories generated by the student itself:
This formulation is equivalent to a policy-gradient approach where each generated token receives as reward the log-likelihood ratio:
Generalizing to training with importance-weighted sampling from a previous policy , the objective becomes:
Using a "stop-gradient" on reduces the variance of the gradient estimator without biasing its expectation, empirically stabilizing training dynamics.
2. Core Mechanisms of the REOPOLD Framework
REOPOLD augments vanilla on-policy RKL objectives via selective and tempered token-level reward processing. The principal objective is:
with indexing sampled responses, 0 tokens, 1 current training step, and 2 indicating eligibility for gradient update at each token.
Three key mechanisms constitute REOPOLD:
2.1 Mixture-Based Reward Clipping
The raw token reward 3 diverges to 4 as 5, creating outliers with extreme negative values. REOPOLD applies a lower floor derived via Jensen's inequality on a mixture policy 6, bounding the reward:
7
with 8 as 9.
The clipped reward is defined as
0
ensuring only extremely negative rewards are suppressed while others remain unmodified.
2.2 Entropy-Based Token-Level Dynamic Sampling
Empirically, most tokens possess low entropy under the student and yield near-zero reward signals, while high-entropy tokens dominate policy divergence. REOPOLD dynamically constructs a mask 1 at each refinement step (2): for token entropy 3 and threshold 4 as the 5-percentile, only tokens with 6 receive gradient updates. This approach focuses learning on regions of greatest uncertainty.
2.3 Unified Exploration-to-Refinement Multi-Stage Training
REOPOLD training is divided into two sequential regimes:
- Phase I (Exploration, 7): Mask disables tokens with extremely negative rewards (8 iff 9). No entropy mask is applied; broad, positive-only supervision is enforced, akin to supervised finetuning.
- Phase II (Refinement, 0): Mask selects only top-uncertainty tokens using the entropy threshold; updates concentrate on uncertain areas.
The process is realized in Algorithm 1, with iterative policy updates, reward clipping, masking, and gradient computation following these stages.
3. Training Stability and Reduction of Negative Transfer
REOPOLD addresses optimization stability and negative transfer through:
- Heavy-tail reward clipping: Prevents destabilizing gradient updates from extreme log-ratio outliers.
- Stop-gradient rewards: Eliminates gradient components arising from variations in 1, lowering update variance.
- Token-level sampling: Excludes low-entropy, low-signal tokens from updates, thereby enhancing sample efficiency and convergence.
- Two-phase masking: By preserving broad exploration in early training and limiting punishment for low-probability tokens, REOPOLD avoids entropy collapse and premature loss of policy diversity.
Collectively, these mechanisms yield a stabilized distillation process with consistent and efficient transfer of teacher skills.
4. Empirical Evaluation and Outcomes
REOPOLD was evaluated across a range of reasoning benchmarks with distinct modalities:
- Mathematical reasoning: On datasets such as AIME-24/25, AMC-23, MATH-500, Minerva Math, and Olympiad Bench; distilled from SkyWork-OR1 and DeepSeek-R1 teachers into 1.5B or 7B student models.
- Visual reasoning: Six VQA/comprehension tasks (Geometry3K, MathVerse, MathVision, MathVista, WeMath, Hallusion) with Qwen2.5-VL students distilled from a 32B teacher.
- Agentic visual tool-use: Four benchmarks (PixelReasoner, V-Star, InfoVQA, TallyQA) using the VerlTool framework.
Empirical advantages include:
| Task Type | Baseline | REOPOLD Sample Efficiency | Inference Speedup |
|---|---|---|---|
| Math reasoning | ProRL, DeepScaleR, DeepMath | 6.7×–12× fewer samples | N/A |
| Visual reasoning | 32B teacher | Matched accuracy with 7B student | 3.32× (on H100 + vLLM) |
| Agentic | GRPO, vanilla RKL | Outperforms at 50% training steps | N/A |
Ablation studies demonstrate that stop-gradient rewards lower gradient norm variance and improve val accuracy; mixture-based reward clipping adds robustness across 2; entropy masking accelerates convergence; and the phased regime prevents entropy collapse and yields higher accuracy (see Figures 1–7 and Table 8 in (Ko et al., 11 Mar 2026)).
5. Component Analysis and Ablation
Experimental ablations quantify each component's contribution:
- Stop-gradient independently stabilizes learning (improved gradient norm, higher validation accuracy).
- Reward clipping further improves both robustness and convergence, outperforming alternatives such as skew-RKL.
- Entropy-based masking notably boosts convergence speed and ultimate accuracy.
- Exploration-refinement regime avoids entropy collapse observed in vanilla reverse-KL; increases cumulative metrics such as Avg@32 and Pass@32.
- Visual reasoning ablations indicate individual components each provide absolute improvements of roughly 0.4–2%.
6. Limitations and Prospects for Future Research
REOPOLD's hyperparameters (3, 4, 5) are selected by domain-independent heuristics (e.g., 6); these may require further optimization per domain. The methodology presupposes on-policy sampling, and potential extensions could explore integration with off-policy data or replay buffers to enhance efficiency. Application of REOPOLD to additional modalities—such as code generation, diffusion models, or continuous control—remains an open research area. Analytical investigation of adaptive or multi-stage exploration-refinement scheduling is a promising direction for future work.
In summary, REOPOLD reconceptualizes on-policy distillation as a structured policy-optimization routine, leveraging targeted and relaxed token-level controls to overcome the rigidity and instability of classical methods. These innovations yield marked improvements in both sample efficiency and inference scalability across mathematical, visual, and agentic reasoning settings, representing a significant advance in knowledge distillation approaches (Ko et al., 11 Mar 2026).