Papers
Topics
Authors
Recent
Search
2000 character limit reached

Self-Distilled Policy Gradient (SDPG)

Updated 4 June 2026
  • SDPG is a reinforcement learning framework designed for long-sequence tasks with sparse rewards, integrating group-relative policy gradients with self-distillation.
  • It leverages dense per-token supervision and a reference-policy KL regularizer to stabilize training and address coarse credit assignment.
  • Empirical results show SDPG achieves faster convergence, stable entropy, and improved sample efficiency over conventional sparse-reward methods.

Self-Distilled Policy Gradient (SDPG) is an on-policy reinforcement learning (RL) framework designed for long-sequence generation tasks with sparse, sequence-level rewards, such as mathematical reasoning. SDPG fuses group-relative outcome-driven policy gradients with exact full-vocabulary on-policy self-distillation and reference-policy regularization. By integrating dense, per-token self-supervision with groupwise verifier advantages and a stabilizing policy anchor, SDPG addresses the credit assignment and instability issues that typify sparse-reward environments, and empirically demonstrates improved stability and sample efficiency over RLVR and previous self-distillation baselines (Liu et al., 2 Jun 2026).

1. Motivation and Problem Statement

Sparse-reward RL in long-sequence domains (e.g., code or mathematics) is ill-posed under standard policy-gradient methods due to two major challenges:

  • Coarse Credit Assignment: With only sequence-level binary verifier reward R(x,y){0,1}R(x, y) \in \{0,1\}, standard approaches such as Group-Relative Policy Optimization (GRPO) compute a group-based normalized advantage Aout(i)A_\text{out}^{(i)}, which is then broadcast to every token in a trajectory. This results in highly coarse-grained, noisy supervision at the token level.
  • Train Instability from Negative Advantages: Early training is dominated by incorrect sequences, yielding many negative group-relative advantages. PPO-style clipping over-penalizes the policy, leading to slow convergence and unstable updates.

On-policy self-distillation, instantiated via a model acting as both a “student” (conditioned only on input xx) and a “teacher” (conditioned on privileged context cc, e.g., reference solutions), provides an auxiliary, dense, tokenwise supervisory signal. However, naive use may reinforce locally plausible yet globally invalid rollouts or cause entropy (mode) collapse.

SDPG unifies three loss components: (a) group-relative outcome-reward policy gradients, (b) a gated per-token self-distillation loss applied only to advantageous trajectories, and (c) a reference-policy Kullback-Leibler (KL) regularizer to anchor policy drift.

2. Mathematical Formulation

SDPG minimizes a composite loss: LSDPG(θ)=Lout(θ)+β(k)LOPD+(θ)+αLKL(θ)L_{\mathrm{SDPG}}(\theta) = L_{\mathrm{out}}(\theta) + \beta(k) L_{\mathrm{OPD}^+}(\theta) + \alpha L_{\mathrm{KL}}(\theta)

Where:

  • LoutL_{\mathrm{out}}: Outcome-reward group-relative loss.
  • LOPD+L_{\mathrm{OPD}^+}: Positive-advantage-gated, full-vocabulary, on-policy self-distillation (OPD) loss.
  • LKLL_{\mathrm{KL}}: KL regularization anchoring the policy to a reference πref\pi_\mathrm{ref}.
  • β(k)\beta(k): A schedule controlling self-distillation strength, with warm-up and late-training decay.
  • Aout(i)A_\text{out}^{(i)}0: Fixed KL regularization weight.

The outcome-reward component applies a normalized group-relative advantage

Aout(i)A_\text{out}^{(i)}1

with gating Aout(i)A_\text{out}^{(i)}2 to permit distillation only on successful trajectories. For each sequence position, the student distribution Aout(i)A_\text{out}^{(i)}3 is contrasted with the teacher Aout(i)A_\text{out}^{(i)}4 via the reverse KL: Aout(i)A_\text{out}^{(i)}5 where SG denotes “stop-gradient”.

The reference-policy KL term may be computed as forward or reverse, unnormalized KL to avoid bias during rollout sampling.

3. Training Algorithm

The SDPG training procedure involves the following steps for each batch:

  1. Rollout Sampling: For each input Aout(i)A_\text{out}^{(i)}6, sample Aout(i)A_\text{out}^{(i)}7 responses under the current policy.
  2. Advantage Computation: For each rollout, compute the verifier reward, group mean, std, normalized advantage, and positive-advantage gate.
  3. Loss Evaluation: For each token, evaluate the negative log-likelihood (“outcome”) loss, self-distillation KL (if Aout(i)A_\text{out}^{(i)}8), and reference KL.
  4. Aggregate and Update: Compute the total batch loss and update policy parameters.

Explicit pseudocode is provided as follows:

LSDPG(θ)=Lout(θ)+β(k)LOPD+(θ)+αLKL(θ)L_{\mathrm{SDPG}}(\theta) = L_{\mathrm{out}}(\theta) + \beta(k) L_{\mathrm{OPD}^+}(\theta) + \alpha L_{\mathrm{KL}}(\theta)4

4. Theoretical Properties

SDPG’s self-distillation term—full-vocabulary reverse KL between the student and context-enriched teacher—yields a per-token, locally equivalent, variance-reduced policy gradient. The student-side gradient at token Aout(i)A_\text{out}^{(i)}9 is:

xx0

which matches an on-policy policy gradient with a centered log-ratio advantage: xx1 and zero mean under xx2.

Positive-advantage gating ensures that distillation does not reinforce locally plausible but globally invalid sequences. The scheduled decay of xx3 phases out reliance on privileged teaching signals, enabling exploration and effective student deployment. Light KL anchoring (xx4) is necessary for entropy stability; omission leads to response drift or runaway policy entropy.

5. Empirical Results

SDPG was benchmarked on math reasoning tasks (DAPO-Math-17k) using Qwen3-4B and Qwen3-1.7B. Privileged context was generated by Gemini 2.5 Pro (“correct answer + chain of thought”). Baselines included GRPO (standard group-relative PPO), RLSD (reward-reweighted self-distillation), and OPCD (pure on-policy context distillation, 1.7B only).

Key findings:

  • Both SDPG-URKL and SDPG-UFKL outperform GRPO and RLSD at both model scales and on all main math benchmarks (AIME 2024, AIME 2025, AMC 23; pass@1, mean@32).
  • SDPG achieves faster convergence and reaches reward plateaus several hundred steps earlier compared to baselines.
  • Entropy is stably maintained (xx5) throughout SDPG training, while RLSD’s entropy collapses by step 250.
  • Response length is moderated in SDPG models; baselines experience verbosity collapse or underproduction.
  • Ablation: removing KL anchoring (xx6) preserves early convergence but induces output drift; removing self-distillation (xx7) eliminates early benchmark gains.

6. Implementation Details

The SDPG experiments were implemented as follows:

  • Optimization: AdamW, learning rate xx8, weight decay xx9, momentum cc0, gradient clip cc1.
  • Batching: Global batch size cc2 prompts, cc3 rollouts per prompt, temperature cc4.
  • Precision and Parallelism: FSDP + bfloat16, rollout engine via vLLM, 8 × NVIDIA H100 GPUs.
  • Sequence length: Max prompt cc5, max response cc6 (dynamic batching).
  • Self-distillation schedule: cc7, warmup cc8 steps, decay cc9 steps, total LSDPG(θ)=Lout(θ)+β(k)LOPD+(θ)+αLKL(θ)L_{\mathrm{SDPG}}(\theta) = L_{\mathrm{out}}(\theta) + \beta(k) L_{\mathrm{OPD}^+}(\theta) + \alpha L_{\mathrm{KL}}(\theta)0.
  • Verifier: LSDPG(θ)=Lout(θ)+β(k)LOPD+(θ)+αLKL(θ)L_{\mathrm{SDPG}}(\theta) = L_{\mathrm{out}}(\theta) + \beta(k) L_{\mathrm{OPD}^+}(\theta) + \alpha L_{\mathrm{KL}}(\theta)1; PPO clip thresholds LSDPG(θ)=Lout(θ)+β(k)LOPD+(θ)+αLKL(θ)L_{\mathrm{SDPG}}(\theta) = L_{\mathrm{out}}(\theta) + \beta(k) L_{\mathrm{OPD}^+}(\theta) + \alpha L_{\mathrm{KL}}(\theta)2.
  • Reference policy: Fixed initialization or checkpoint, LSDPG(θ)=Lout(θ)+β(k)LOPD+(θ)+αLKL(θ)L_{\mathrm{SDPG}}(\theta) = L_{\mathrm{out}}(\theta) + \beta(k) L_{\mathrm{OPD}^+}(\theta) + \alpha L_{\mathrm{KL}}(\theta)3.

The code is available at https://github.com/lauyikfung/SDPG.

SDPG can be viewed as a principled extension of self-distilled policy optimization (SDPO) frameworks (Wang et al., 2 Jun 2026), combining reverse-KL self-distillation (per-token and on-policy) with outcome-driven reinforcement learning and stability-improving anchors. Alternative approaches, such as Physics-Guided Policy Optimization (PGPO), apply information-modulated step-size control to self-distillation, but do not couple outcome-reward policy gradient with verifier-gated self-distillation or utilize reference policy anchoring.

Theoretical advances in SDPG include the identification of local equivalence between the self-distillation KL gradient and a variance-reduced, centered log-ratio advantage policy gradient. Empirically, SDPG demonstrates improved stability, faster convergence, and higher benchmark accuracy in sparse-reward long-sequence tasks compared to RLVR, SDPO, and reward-reweighted self-distillation alternatives (Liu et al., 2 Jun 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Self-Distilled Policy Gradient (SDPG).