Self-Distilled Policy Gradient (SDPG)
- SDPG is a reinforcement learning framework designed for long-sequence tasks with sparse rewards, integrating group-relative policy gradients with self-distillation.
- It leverages dense per-token supervision and a reference-policy KL regularizer to stabilize training and address coarse credit assignment.
- Empirical results show SDPG achieves faster convergence, stable entropy, and improved sample efficiency over conventional sparse-reward methods.
Self-Distilled Policy Gradient (SDPG) is an on-policy reinforcement learning (RL) framework designed for long-sequence generation tasks with sparse, sequence-level rewards, such as mathematical reasoning. SDPG fuses group-relative outcome-driven policy gradients with exact full-vocabulary on-policy self-distillation and reference-policy regularization. By integrating dense, per-token self-supervision with groupwise verifier advantages and a stabilizing policy anchor, SDPG addresses the credit assignment and instability issues that typify sparse-reward environments, and empirically demonstrates improved stability and sample efficiency over RLVR and previous self-distillation baselines (Liu et al., 2 Jun 2026).
1. Motivation and Problem Statement
Sparse-reward RL in long-sequence domains (e.g., code or mathematics) is ill-posed under standard policy-gradient methods due to two major challenges:
- Coarse Credit Assignment: With only sequence-level binary verifier reward , standard approaches such as Group-Relative Policy Optimization (GRPO) compute a group-based normalized advantage , which is then broadcast to every token in a trajectory. This results in highly coarse-grained, noisy supervision at the token level.
- Train Instability from Negative Advantages: Early training is dominated by incorrect sequences, yielding many negative group-relative advantages. PPO-style clipping over-penalizes the policy, leading to slow convergence and unstable updates.
On-policy self-distillation, instantiated via a model acting as both a “student” (conditioned only on input ) and a “teacher” (conditioned on privileged context , e.g., reference solutions), provides an auxiliary, dense, tokenwise supervisory signal. However, naive use may reinforce locally plausible yet globally invalid rollouts or cause entropy (mode) collapse.
SDPG unifies three loss components: (a) group-relative outcome-reward policy gradients, (b) a gated per-token self-distillation loss applied only to advantageous trajectories, and (c) a reference-policy Kullback-Leibler (KL) regularizer to anchor policy drift.
2. Mathematical Formulation
SDPG minimizes a composite loss:
Where:
- : Outcome-reward group-relative loss.
- : Positive-advantage-gated, full-vocabulary, on-policy self-distillation (OPD) loss.
- : KL regularization anchoring the policy to a reference .
- : A schedule controlling self-distillation strength, with warm-up and late-training decay.
- 0: Fixed KL regularization weight.
The outcome-reward component applies a normalized group-relative advantage
1
with gating 2 to permit distillation only on successful trajectories. For each sequence position, the student distribution 3 is contrasted with the teacher 4 via the reverse KL: 5 where SG denotes “stop-gradient”.
The reference-policy KL term may be computed as forward or reverse, unnormalized KL to avoid bias during rollout sampling.
3. Training Algorithm
The SDPG training procedure involves the following steps for each batch:
- Rollout Sampling: For each input 6, sample 7 responses under the current policy.
- Advantage Computation: For each rollout, compute the verifier reward, group mean, std, normalized advantage, and positive-advantage gate.
- Loss Evaluation: For each token, evaluate the negative log-likelihood (“outcome”) loss, self-distillation KL (if 8), and reference KL.
- Aggregate and Update: Compute the total batch loss and update policy parameters.
Explicit pseudocode is provided as follows:
4
4. Theoretical Properties
SDPG’s self-distillation term—full-vocabulary reverse KL between the student and context-enriched teacher—yields a per-token, locally equivalent, variance-reduced policy gradient. The student-side gradient at token 9 is:
0
which matches an on-policy policy gradient with a centered log-ratio advantage: 1 and zero mean under 2.
Positive-advantage gating ensures that distillation does not reinforce locally plausible but globally invalid sequences. The scheduled decay of 3 phases out reliance on privileged teaching signals, enabling exploration and effective student deployment. Light KL anchoring (4) is necessary for entropy stability; omission leads to response drift or runaway policy entropy.
5. Empirical Results
SDPG was benchmarked on math reasoning tasks (DAPO-Math-17k) using Qwen3-4B and Qwen3-1.7B. Privileged context was generated by Gemini 2.5 Pro (“correct answer + chain of thought”). Baselines included GRPO (standard group-relative PPO), RLSD (reward-reweighted self-distillation), and OPCD (pure on-policy context distillation, 1.7B only).
Key findings:
- Both SDPG-URKL and SDPG-UFKL outperform GRPO and RLSD at both model scales and on all main math benchmarks (AIME 2024, AIME 2025, AMC 23; pass@1, mean@32).
- SDPG achieves faster convergence and reaches reward plateaus several hundred steps earlier compared to baselines.
- Entropy is stably maintained (5) throughout SDPG training, while RLSD’s entropy collapses by step 250.
- Response length is moderated in SDPG models; baselines experience verbosity collapse or underproduction.
- Ablation: removing KL anchoring (6) preserves early convergence but induces output drift; removing self-distillation (7) eliminates early benchmark gains.
6. Implementation Details
The SDPG experiments were implemented as follows:
- Optimization: AdamW, learning rate 8, weight decay 9, momentum 0, gradient clip 1.
- Batching: Global batch size 2 prompts, 3 rollouts per prompt, temperature 4.
- Precision and Parallelism: FSDP + bfloat16, rollout engine via vLLM, 8 × NVIDIA H100 GPUs.
- Sequence length: Max prompt 5, max response 6 (dynamic batching).
- Self-distillation schedule: 7, warmup 8 steps, decay 9 steps, total 0.
- Verifier: 1; PPO clip thresholds 2.
- Reference policy: Fixed initialization or checkpoint, 3.
The code is available at https://github.com/lauyikfung/SDPG.
7. Context, Related Approaches, and Significance
SDPG can be viewed as a principled extension of self-distilled policy optimization (SDPO) frameworks (Wang et al., 2 Jun 2026), combining reverse-KL self-distillation (per-token and on-policy) with outcome-driven reinforcement learning and stability-improving anchors. Alternative approaches, such as Physics-Guided Policy Optimization (PGPO), apply information-modulated step-size control to self-distillation, but do not couple outcome-reward policy gradient with verifier-gated self-distillation or utilize reference policy anchoring.
Theoretical advances in SDPG include the identification of local equivalence between the self-distillation KL gradient and a variance-reduced, centered log-ratio advantage policy gradient. Empirically, SDPG demonstrates improved stability, faster convergence, and higher benchmark accuracy in sparse-reward long-sequence tasks compared to RLVR, SDPO, and reward-reweighted self-distillation alternatives (Liu et al., 2 Jun 2026).