Papers
Topics
Authors
Recent
Search
2000 character limit reached

REOPOLD: Efficient Policy Distillation

Updated 14 May 2026
  • REOPOLD is a training framework that efficiently transfers reasoning capabilities from a high-capacity teacher to a smaller student model with improved stability and efficiency.
  • It employs mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement strategy to mitigate instability and negative transfer.
  • Empirical results demonstrate up to 12× sample efficiency improvements and a 3.32× inference speedup across mathematical, visual, and agentic reasoning tasks.

REOPOLD (Relaxed On-Policy Distillation) is a training framework for efficiently transferring reasoning capabilities from a high-capacity teacher model to a smaller student model in the context of sequence modeling and decision-making. REOPOLD addresses the instability and negative transfer present in conventional on-policy distillation by transforming the process into a stabilized policy optimization routine that leverages three central mechanisms: mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD demonstrates substantially improved sample efficiency and inference-time scaling on mathematical, visual, and agentic tool-use reasoning tasks, achieving up to 12× sample efficiency improvements and 3.32× inference speedup over prior baselines on standard benchmarks (Ko et al., 11 Mar 2026).

1. Theoretical Foundations of On-Policy Distillation

On-policy distillation, as formalized in REOPOLD, minimizes the reverse Kullback-Leibler (RKL) divergence from the student policy πθ\pi_\theta to the teacher policy πT\pi_T, evaluated on trajectories generated by the student itself:

DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].

This formulation is equivalent to a policy-gradient approach where each generated token oi,to_{i, t} receives as reward the log-likelihood ratio:

Ri,t(θ)=log(πT(oi,tq,oi,<t)πθ(oi,tq,oi,<t)).R_{i,t}(\theta) = \log \left( \frac{\pi_T(o_{i,t}\,|\,q,o_{i,<t})}{\pi_\theta(o_{i,t}\,|\,q,o_{i,<t})} \right).

Generalizing to training with importance-weighted sampling from a previous policy πθold\pi_{\theta_\text{old}}, the objective becomes:

JRKL(θ)=Eq,oπθold[1ot=1oρt(θ)Rt(θ)],whereρt(θ)=πθ(ot)πθold(ot).J_\text{RKL}(\theta) = \mathbb{E}_{q,o \sim \pi_{\theta_\text{old}}} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \rho_t(\theta) R_t(\theta) \right], \quad \text{where} \quad \rho_t(\theta) = \frac{\pi_\theta(o_t\,|\,\cdots)}{\pi_{\theta_\text{old}}(o_t\,|\,\cdots)}.

Using a "stop-gradient" on Rt(θ)R_t(\theta) reduces the variance of the gradient estimator without biasing its expectation, empirically stabilizing training dynamics.

2. Core Mechanisms of the REOPOLD Framework

REOPOLD augments vanilla on-policy RKL objectives via selective and tempered token-level reward processing. The principal objective is:

JReopold(θ)=Eq,oπθold[1i,tMi,t(k)i,tρi,t(θ)R^i,tλ(θ)Mi,t(k)],J_\text{Reopold}(\theta) = \mathbb{E}_{q, o \sim \pi_{\theta_\text{old}}} \left[ \frac{1}{\sum_{i,t} M_{i,t}^{(k)}} \sum_{i,t} \rho_{i,t}(\theta) \hat{R}_{i,t}^\lambda(\theta) M_{i,t}^{(k)} \right],

with ii indexing sampled responses, πT\pi_T0 tokens, πT\pi_T1 current training step, and πT\pi_T2 indicating eligibility for gradient update at each token.

Three key mechanisms constitute REOPOLD:

2.1 Mixture-Based Reward Clipping

The raw token reward πT\pi_T3 diverges to πT\pi_T4 as πT\pi_T5, creating outliers with extreme negative values. REOPOLD applies a lower floor derived via Jensen's inequality on a mixture policy πT\pi_T6, bounding the reward:

πT\pi_T7

with πT\pi_T8 as πT\pi_T9.

The clipped reward is defined as

DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].0

ensuring only extremely negative rewards are suppressed while others remain unmodified.

2.2 Entropy-Based Token-Level Dynamic Sampling

Empirically, most tokens possess low entropy under the student and yield near-zero reward signals, while high-entropy tokens dominate policy divergence. REOPOLD dynamically constructs a mask DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].1 at each refinement step (DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].2): for token entropy DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].3 and threshold DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].4 as the DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].5-percentile, only tokens with DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].6 receive gradient updates. This approach focuses learning on regions of greatest uncertainty.

2.3 Unified Exploration-to-Refinement Multi-Stage Training

REOPOLD training is divided into two sequential regimes:

  • Phase I (Exploration, DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].7): Mask disables tokens with extremely negative rewards (DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].8 iff DKL(πθπT)=EqQ,oπθ[logπT(oq)logπθ(oq)].D_{\text{KL}}(\pi_\theta \,||\,\pi_T) = -\mathbb{E}_{q \sim Q,\, o\sim \pi_\theta} [\log \pi_T(o|q) - \log \pi_\theta(o|q)].9). No entropy mask is applied; broad, positive-only supervision is enforced, akin to supervised finetuning.
  • Phase II (Refinement, oi,to_{i, t}0): Mask selects only top-uncertainty tokens using the entropy threshold; updates concentrate on uncertain areas.

The process is realized in Algorithm 1, with iterative policy updates, reward clipping, masking, and gradient computation following these stages.

3. Training Stability and Reduction of Negative Transfer

REOPOLD addresses optimization stability and negative transfer through:

  • Heavy-tail reward clipping: Prevents destabilizing gradient updates from extreme log-ratio outliers.
  • Stop-gradient rewards: Eliminates gradient components arising from variations in oi,to_{i, t}1, lowering update variance.
  • Token-level sampling: Excludes low-entropy, low-signal tokens from updates, thereby enhancing sample efficiency and convergence.
  • Two-phase masking: By preserving broad exploration in early training and limiting punishment for low-probability tokens, REOPOLD avoids entropy collapse and premature loss of policy diversity.

Collectively, these mechanisms yield a stabilized distillation process with consistent and efficient transfer of teacher skills.

4. Empirical Evaluation and Outcomes

REOPOLD was evaluated across a range of reasoning benchmarks with distinct modalities:

  • Mathematical reasoning: On datasets such as AIME-24/25, AMC-23, MATH-500, Minerva Math, and Olympiad Bench; distilled from SkyWork-OR1 and DeepSeek-R1 teachers into 1.5B or 7B student models.
  • Visual reasoning: Six VQA/comprehension tasks (Geometry3K, MathVerse, MathVision, MathVista, WeMath, Hallusion) with Qwen2.5-VL students distilled from a 32B teacher.
  • Agentic visual tool-use: Four benchmarks (PixelReasoner, V-Star, InfoVQA, TallyQA) using the VerlTool framework.

Empirical advantages include:

Task Type Baseline REOPOLD Sample Efficiency Inference Speedup
Math reasoning ProRL, DeepScaleR, DeepMath 6.7×–12× fewer samples N/A
Visual reasoning 32B teacher Matched accuracy with 7B student 3.32× (on H100 + vLLM)
Agentic GRPO, vanilla RKL Outperforms at 50% training steps N/A

Ablation studies demonstrate that stop-gradient rewards lower gradient norm variance and improve val accuracy; mixture-based reward clipping adds robustness across oi,to_{i, t}2; entropy masking accelerates convergence; and the phased regime prevents entropy collapse and yields higher accuracy (see Figures 1–7 and Table 8 in (Ko et al., 11 Mar 2026)).

5. Component Analysis and Ablation

Experimental ablations quantify each component's contribution:

  • Stop-gradient independently stabilizes learning (improved gradient norm, higher validation accuracy).
  • Reward clipping further improves both robustness and convergence, outperforming alternatives such as skew-RKL.
  • Entropy-based masking notably boosts convergence speed and ultimate accuracy.
  • Exploration-refinement regime avoids entropy collapse observed in vanilla reverse-KL; increases cumulative metrics such as Avg@32 and Pass@32.
  • Visual reasoning ablations indicate individual components each provide absolute improvements of roughly 0.4–2%.

6. Limitations and Prospects for Future Research

REOPOLD's hyperparameters (oi,to_{i, t}3, oi,to_{i, t}4, oi,to_{i, t}5) are selected by domain-independent heuristics (e.g., oi,to_{i, t}6); these may require further optimization per domain. The methodology presupposes on-policy sampling, and potential extensions could explore integration with off-policy data or replay buffers to enhance efficiency. Application of REOPOLD to additional modalities—such as code generation, diffusion models, or continuous control—remains an open research area. Analytical investigation of adaptive or multi-stage exploration-refinement scheduling is a promising direction for future work.

In summary, REOPOLD reconceptualizes on-policy distillation as a structured policy-optimization routine, leveraging targeted and relaxed token-level controls to overcome the rigidity and instability of classical methods. These innovations yield marked improvements in both sample efficiency and inference scalability across mathematical, visual, and agentic reasoning settings, representing a significant advance in knowledge distillation approaches (Ko et al., 11 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to REOPOLD.