REOPOLD: Efficient Policy Distillation

Updated 14 May 2026

REOPOLD is a training framework that efficiently transfers reasoning capabilities from a high-capacity teacher to a smaller student model with improved stability and efficiency.
It employs mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement strategy to mitigate instability and negative transfer.
Empirical results demonstrate up to 12× sample efficiency improvements and a 3.32× inference speedup across mathematical, visual, and agentic reasoning tasks.

REOPOLD (Relaxed On-Policy Distillation) is a training framework for efficiently transferring reasoning capabilities from a high-capacity teacher model to a smaller student model in the context of sequence modeling and decision-making. REOPOLD addresses the instability and negative transfer present in conventional on-policy distillation by transforming the process into a stabilized policy optimization routine that leverages three central mechanisms: mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD demonstrates substantially improved sample efficiency and inference-time scaling on mathematical, visual, and agentic tool-use reasoning tasks, achieving up to 12× sample efficiency improvements and 3.32× inference speedup over prior baselines on standard benchmarks (Ko et al., 11 Mar 2026).

1. Theoretical Foundations of On-Policy Distillation

On-policy distillation, as formalized in REOPOLD, minimizes the reverse Kullback-Leibler (RKL) divergence from the student policy $\pi_\theta$ to the teacher policy $\pi_T$ , evaluated on trajectories generated by the student itself:

$D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$

This formulation is equivalent to a policy-gradient approach where each generated token $o_{i, t}$ receives as reward the log-likelihood ratio:

$R_{i,t}(\theta) = \log \left( \frac{\pi_T(o_{i,t}\,|\,q,o_{i,<t})}{\pi_\theta(o_{i,t}\,|\,q,o_{i,<t})} \right).$

Generalizing to training with importance-weighted sampling from a previous policy $\pi_{\theta_\text{old}}$ , the objective becomes:

$J_\text{RKL}(\theta) = \mathbb{E}_{q,o \sim \pi_{\theta_\text{old}}} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \rho_t(\theta) R_t(\theta) \right], \quad \text{where} \quad \rho_t(\theta) = \frac{\pi_\theta(o_t\,|\,\cdots)}{\pi_{\theta_\text{old}}(o_t\,|\,\cdots)}.$

Using a "stop-gradient" on $R_t(\theta)$ reduces the variance of the gradient estimator without biasing its expectation, empirically stabilizing training dynamics.

2. Core Mechanisms of the REOPOLD Framework

REOPOLD augments vanilla on-policy RKL objectives via selective and tempered token-level reward processing. The principal objective is:

$J_\text{Reopold}(\theta) = \mathbb{E}_{q, o \sim \pi_{\theta_\text{old}}} \left[ \frac{1}{\sum_{i,t} M_{i,t}^{(k)}} \sum_{i,t} \rho_{i,t}(\theta) \hat{R}_{i,t}^\lambda(\theta) M_{i,t}^{(k)} \right],$

with $i$ indexing sampled responses, $\pi_T$ 0 tokens, $\pi_T$ 1 current training step, and $\pi_T$ 2 indicating eligibility for gradient update at each token.

Three key mechanisms constitute REOPOLD:

2.1 Mixture-Based Reward Clipping

The raw token reward $\pi_T$ 3 diverges to $\pi_T$ 4 as $\pi_T$ 5, creating outliers with extreme negative values. REOPOLD applies a lower floor derived via Jensen's inequality on a mixture policy $\pi_T$ 6, bounding the reward:

$\pi_T$ 7

with $\pi_T$ 8 as $\pi_T$ 9.

The clipped reward is defined as

$D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$ 0

ensuring only extremely negative rewards are suppressed while others remain unmodified.

2.2 Entropy-Based Token-Level Dynamic Sampling

Empirically, most tokens possess low entropy under the student and yield near-zero reward signals, while high-entropy tokens dominate policy divergence. REOPOLD dynamically constructs a mask $D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$ 1 at each refinement step ( $D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$ 2): for token entropy $D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$ 3 and threshold $D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$ 4 as the $D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$ 5-percentile, only tokens with $D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$ 6 receive gradient updates. This approach focuses learning on regions of greatest uncertainty.

REOPOLD training is divided into two sequential regimes:

Phase I (Exploration, $D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$ 7): Mask disables tokens with extremely negative rewards ( $D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$ 8 iff $D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].$ 9). No entropy mask is applied; broad, positive-only supervision is enforced, akin to supervised finetuning.
Phase II (Refinement, $o_{i, t}$ 0): Mask selects only top-uncertainty tokens using the entropy threshold; updates concentrate on uncertain areas.

The process is realized in Algorithm 1, with iterative policy updates, reward clipping, masking, and gradient computation following these stages.

3. Training Stability and Reduction of Negative Transfer

REOPOLD addresses optimization stability and negative transfer through:

Heavy-tail reward clipping: Prevents destabilizing gradient updates from extreme log-ratio outliers.
Stop-gradient rewards: Eliminates gradient components arising from variations in $o_{i, t}$ 1, lowering update variance.
Token-level sampling: Excludes low-entropy, low-signal tokens from updates, thereby enhancing sample efficiency and convergence.
Two-phase masking: By preserving broad exploration in early training and limiting punishment for low-probability tokens, REOPOLD avoids entropy collapse and premature loss of policy diversity.

Collectively, these mechanisms yield a stabilized distillation process with consistent and efficient transfer of teacher skills.

4. Empirical Evaluation and Outcomes

REOPOLD was evaluated across a range of reasoning benchmarks with distinct modalities:

Mathematical reasoning: On datasets such as AIME-24/25, AMC-23, MATH-500, Minerva Math, and Olympiad Bench; distilled from SkyWork-OR1 and DeepSeek-R1 teachers into 1.5B or 7B student models.
Visual reasoning: Six VQA/comprehension tasks (Geometry3K, MathVerse, MathVision, MathVista, WeMath, Hallusion) with Qwen2.5-VL students distilled from a 32B teacher.
Agentic visual tool-use: Four benchmarks (PixelReasoner, V-Star, InfoVQA, TallyQA) using the VerlTool framework.

Empirical advantages include:

Task Type	Baseline	REOPOLD Sample Efficiency	Inference Speedup
Math reasoning	ProRL, DeepScaleR, DeepMath	6.7×–12× fewer samples	N/A
Visual reasoning	32B teacher	Matched accuracy with 7B student	3.32× (on H100 + vLLM)
Agentic	GRPO, vanilla RKL	Outperforms at 50% training steps	N/A

Ablation studies demonstrate that stop-gradient rewards lower gradient norm variance and improve val accuracy; mixture-based reward clipping adds robustness across $o_{i, t}$ 2; entropy masking accelerates convergence; and the phased regime prevents entropy collapse and yields higher accuracy (see Figures 1–7 and Table 8 in (Ko et al., 11 Mar 2026)).

5. Component Analysis and Ablation

Experimental ablations quantify each component's contribution:

Stop-gradient independently stabilizes learning (improved gradient norm, higher validation accuracy).
Reward clipping further improves both robustness and convergence, outperforming alternatives such as skew-RKL.
Entropy-based masking notably boosts convergence speed and ultimate accuracy.
Exploration-refinement regime avoids entropy collapse observed in vanilla reverse-KL; increases cumulative metrics such as Avg@32 and Pass@32.
Visual reasoning ablations indicate individual components each provide absolute improvements of roughly 0.4–2%.

6. Limitations and Prospects for Future Research

REOPOLD's hyperparameters ( $o_{i, t}$ 3, $o_{i, t}$ 4, $o_{i, t}$ 5) are selected by domain-independent heuristics (e.g., $o_{i, t}$ 6); these may require further optimization per domain. The methodology presupposes on-policy sampling, and potential extensions could explore integration with off-policy data or replay buffers to enhance efficiency. Application of REOPOLD to additional modalities—such as code generation, diffusion models, or continuous control—remains an open research area. Analytical investigation of adaptive or multi-stage exploration-refinement scheduling is a promising direction for future work.

In summary, REOPOLD reconceptualizes on-policy distillation as a structured policy-optimization routine, leveraging targeted and relaxed token-level controls to overcome the rigidity and instability of classical methods. These innovations yield marked improvements in both sample efficiency and inference scalability across mathematical, visual, and agentic reasoning settings, representing a significant advance in knowledge distillation approaches (Ko et al., 11 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to REOPOLD.

REOPOLD: Efficient Policy Distillation

1. Theoretical Foundations of On-Policy Distillation

2. Core Mechanisms of the REOPOLD Framework

2.1 Mixture-Based Reward Clipping

2.2 Entropy-Based Token-Level Dynamic Sampling

2.3 Unified Exploration-to-Refinement Multi-Stage Training

3. Training Stability and Reduction of Negative Transfer

4. Empirical Evaluation and Outcomes

5. Component Analysis and Ablation

6. Limitations and Prospects for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

REOPOLD: Efficient Policy Distillation

1. Theoretical Foundations of On-Policy Distillation

2. Core Mechanisms of the REOPOLD Framework

2.1 Mixture-Based Reward Clipping

2.2 Entropy-Based Token-Level Dynamic Sampling

2.3 Unified Exploration-to-Refinement Multi-Stage Training

3. Training Stability and Reduction of Negative Transfer

4. Empirical Evaluation and Outcomes

5. Component Analysis and Ablation

6. Limitations and Prospects for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics