RLSD: Self-Distilled RLVR Framework
- Self-Distilled RLVR (RLSD) is a reinforcement learning framework that integrates token-level self-distillation with verifiable rewards to improve credit assignment in sequence modeling.
- RLSD decouples privileged context from update direction by using dense token-level signals solely to modulate update magnitudes, ensuring leakage-free training.
- Empirical results show RLSD outperforms GRPO with higher accuracy and faster early learning in complex tasks, such as those seen in MathVision benchmarks.
Self-Distilled RLVR (RLSD) is a reinforcement learning framework designed to integrate the dense token-level supervision of on-policy self-distillation (OPSD) with the convergence reliability of reinforcement learning with verifiable rewards (RLVR), thereby achieving accelerated early learning and high long-term stability in sequence modeling tasks. RLSD addresses fundamental limitations of both group-level policy optimization and naive self-distillation—specifically, the credit assignment bottleneck of RLVR and the instability and information leakage exhibited by OPSD.
1. Conceptual Foundation
RLSD arises from the observation that standard RLVR methods, including Group Relative Policy Optimization (GRPO), allocate a single scalar reward or group-based advantage uniformly across all tokens in an output sequence, leading to coarse-grained credit assignment. This approach hinders efficient error localization and slows optimization in long-horizon settings. On the other hand, OPSD employs the same model as both teacher and student, granting the teacher access to privileged information (e.g., reference answers), and optimizing a per-token Kullback–Leibler (KL) divergence between feedback-augmented posteriors and student predictions. OPSD offers rapid initial improvements due to its dense, token-level signal but is susceptible to information leakage and late-stage training collapse: the model may encode spurious dependencies, leading to unstable and unbounded behavior.
RLSD strictly separates the roles of environment-derived rewards and privileged context by using the verifiable reward to dictate the direction (reinforcement or penalty) of updates while leveraging token-level self-distillation solely for modulating update magnitudes. This structured decoupling resolves the instability and leakage issues plaguing prior on-policy distillation schemes (Yang et al., 3 Apr 2026).
2. Mathematical Structure and Training Objective
RLSD comprises the following components:
- Let denote the input prompt, the privileged context (e.g., reference solution), a student-sampled trajectory, and the environment reward indicating correctness.
- The student policy and the teacher policy are defined by the model with and without access to , respectively.
- A group-based advantage is computed for each sampled rollout within a prompt group of size 0, following GRPO:
1
- At each token 2 in a rollout, RLSD computes the privileged information shift:
3
This quantity represents the difference in model confidence with and without privileged information, with gradients blocked to prevent information leakage.
- The per-token weight is derived as:
4
where positive advantages amplify tokens supported by the teacher and negative advantages amplify those disfavored by the teacher.
- The final mixed, clipped token advantage is:
5
with 6 (mixing schedule) decayed from 7 to 8, and 9 (clip threshold) typically set to 0.
- The RLSD policy loss is then:
1
where 2 is the token-wise importance ratio for PPO-style stability.
This construction ensures that privileged information affects only the magnitude and never the direction or support of updates, preventing both overfitting to 3 and catastrophic forgetting (Yang et al., 3 Apr 2026).
3. Algorithm and Practical Implementation
The RLSD algorithm proceeds as follows:
- For each training iteration and prompt 4, sample 5 trajectories 6 from the student policy.
- Compute environment rewards 7 for each rollout and derive group-based advantages 8.
- For each token 9 in each rollout:
- Execute a teacher forward pass with privileged input 0.
- Compute 1 and resulting weight 2, apply clipping.
- Compute token-wise mixed advantages 3.
- Aggregate per-token clipped PPO surrogates using 4 in lieu of standard group advantages.
- Update model parameters by gradient descent. Synchronize teacher parameters with the student every 5 steps and hold the teacher fixed between updates.
Key implementation choices include a learning rate of 6, batch size 7, group size 8, teacher synchronization interval 9, and global gradient clipping to norm 0. The privileged context 1 is used strictly as a stop-grad signal for magnitude scaling, not for update direction (Yang et al., 3 Apr 2026).
4. Theoretical Guarantees and Analysis
RLSD resolves the instability inherent in self-distillation-based reinforcement learning via the following properties:
- Leakage-Free Training: The privileged context 2 cannot influence the sign or support of each token's update. By blocking gradients into 3-dependent quantities and restricting 4 to act as a multiplicative magnitude only, RLSD ensures there is no route for the model to encode 5 correlations, sidestepping the information leakage observed in OPSD.
- Stability and Ceiling: As the student and teacher policies align or as 6, RLSD recovers ordinary GRPO, inheriting its proven convergence guarantees and long-term stability.
- Impossibility Trilemma: Standard OPSD frameworks cannot simultaneously achieve a stable objective, sustained improvement, and leakage-free training under shared parameters. RLSD uniquely satisfies all three desiderata by careful separation of update roles (Yang et al., 3 Apr 2026).
Ablation studies confirm that omitting self-distillation (7) reduces RLSD to GRPO, while removing RLVR's directional anchoring causes OPSD-like collapse. The system remains robust to moderate tuning of 8 and 9-schedule.
5. Empirical Results and Comparative Benchmarks
On the MMFineReason-123K and large-scale multimodal reasoning tasks (MMMU, MathVista, MathVision, ZeroBench, WeMath), RLSD demonstrates superior performance compared to both GRPO and self-distillation baselines:
| Method | Average Accuracy (%) |
|---|---|
| Base LLM | 51.49 |
| GRPO (RLVR) | 53.86 |
| OPSD | 52.49 |
| SDPO | 52.74 |
| GRPO+OPSD | 52.91 |
| RLSD | 56.18 |
Notably, on MathVision, RLSD achieves an improvement of nearly 0 percentage points over GRPO, underscoring the importance of token-level credit assignment in complex mathematical contexts. Training curves show RLSD yields rapid early learning and a higher solution ceiling than GRPO, with policy entropy remaining higher throughout, indicating more persistent exploration at non-critical positions (Yang et al., 3 Apr 2026).
6. Comparative Position Among RLVR and Self-Distillation Methods
RLSD's hybrid credit assignment sharply distinguishes it from both pure RLVR and standard self-distillation: RLVR provides stable but slow group-level learning; self-distillation (OPSD, SDPO) accelerates initial improvement but risks leakage and instability. RLSD achieves a structured synthesis by isolating privileged-context information flow, yielding a method that is both rapidly convergent and stable in the long run.
In comparison, methods such as Sample-Routed Policy Optimization (SRPO) (Li et al., 2 Apr 2026) also route failures to self-distillation and successes to RLVR but differ in architectural detail: RLSD integrates the teacher signal multiplicatively for all tokens, modulated by reward-aligned advantage, rather than assigning roles at the sample level. Both RLSD and SRPO exceed the performance of their constituent methods and provide empirical evidence for the value of hybrid, context-aware credit assignment.
7. Limitations and Future Prospects
RLSD's most salient limitations lie in its current focus on multimodal reasoning tasks and on-policy RLVR settings. Application of RLSD to text-only or video reasoning, as well as adaptation to value-based or off-policy RL, require further algorithmic development—particularly in the treatment of token-level weighting mechanisms. Full automation of schedules for 1 and tight integration with richer supervision sources (beyond fixed reference answers) remain open. Preliminary experiments indicate robust performance beyond the reported domains, but comprehensive studies are needed.
RLSD is implementable as a drop-in replacement for GRPO-style RLVR pipelines. It requires no auxiliary networks or human annotations beyond group-level prompt construction and reference answer provision, making it broadly applicable to a wide array of LLM post-training regimes (Yang et al., 3 Apr 2026).