Papers
Topics
Authors
Recent
Search
2000 character limit reached

SD-Zero: Self-Distillation via Binary Rewards

Updated 17 April 2026
  • SD-Zero is a post-training method that leverages binary reward signals to create dense, token-level self-supervision for large language models.
  • It employs a dual-mode architecture where a single model acts as both generator and reviser, using control prompts based on binary verifications.
  • Empirical results demonstrate that SD-Zero delivers higher sample efficiency and improved performance on math and code benchmarks compared to established methods.

Self-Distillation Zero (SD-Zero) is a post-training method for LLMs designed to transform sparse binary rewards into dense token-level self-supervision. The method leverages binary verifiers—functions that accept or reject a solution based solely on its final answer—eliminating the need for gold solution traces, high-quality demonstrations, or external teacher models. SD-Zero innovates by training a single model to alternately fulfill two roles: a Generator, which proposes candidate solutions, and a Reviser, which self-edits these solutions in light of binary reward signals. Empirical evidence demonstrates that SD-Zero achieves higher sample efficiency and superior downstream performance on math and code reasoning benchmarks compared to baselines like Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), all under matched training budgets (He et al., 13 Apr 2026).

1. Formal Problem Framework

SD-Zero operates in a setting with a dataset D={(xi,ai)}i=1N\mathcal{D} = \{(x_i,a_i)\}_{i=1}^N, where each xx is a reasoning problem (mathematics or code) and aa its canonical answer. The critical restriction is the absence of gold reasoning traces; only a verifier function provides rewards: r(y,a){0,1},r(y,a)=1    final answer in y matches a.r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a. The parameterized policy πθ\pi_\theta generates a response y=(y1,,yT)y=(y_1,\dots,y_T) for input xx, modeling the distribution

πθ(yx)=t=1Tπθ(ytx,y<t).\pi_\theta(y\mid x) = \prod_{t=1}^T \pi_\theta(y_t\mid x, y_{<t}).

The objective is to maximize the probability of correct solutions with supervision consisting only of these binary feedbacks.

2. Generator and Reviser Model Roles

SD-Zero is characterized by two operational model roles:

  • Generator: Given xx, samples a candidate solution yinitπθ(x)y_{\mathrm{init}} \sim \pi_\theta(\cdot\mid x). This employs an autoregressive, decoder-only transformer (e.g., Qwen-3-4B, Olmo-3-7B) with softmax token prediction.
  • Reviser: Accepts xx0, where xx1 is a control prompt encoding the binary reward—either instructing “start over” if xx2 or “rephrase” if xx3. The reviser generates a new trace xx4 by conditioning on the initial response and the reward signal.

The following table summarizes the two roles:

Role Input(s) Output
Generator xx5 xx6
Reviser xx7 xx8

Both modes share architecture and parameters, differing only in input composition.

3. Training Procedure and Objectives

The SD-Zero algorithm comprises two phases: self-revision training followed by on-policy self-distillation.

  • Phase 1: Self-Revision Training
    • For each xx9, the Generator produces aa0. The reviser generates aa1 using the reward-informed control prompt aa2.
    • The revision instance aa3 is retained if aa4.
    • The joint supervised loss has two terms:
    • Revision loss:

    aa5 - Generation loss (for chain of initial plus revised response):

    aa6 - Combined objective: aa7.

  • Phase 2: On-Policy Self-Distillation

    • The student is initialized as aa8 from Phase 1. The teacher is aa9 run in reviser mode.
    • For each r(y,a){0,1},r(y,a)=1    final answer in y matches a.r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.0: sample r(y,a){0,1},r(y,a)=1    final answer in y matches a.r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.1 from student, use reward to condition teacher, then distill the reviser's next-token distributions into the Generator by minimizing KL-divergence:

    r(y,a){0,1},r(y,a)=1    final answer in y matches a.r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.2 - Teacher weights are periodically synchronized to those of the improved student, enabling iterative self-evolution.

4. Algorithmic Structure and Pseudocode

The SD-Zero training loop is formalized as follows:

r(y,a){0,1},r(y,a)=1    final answer in y matches a.r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.8

This framework eliminates the need for external teacher models and high-quality traces, and all data-dependent curricula are instantiations of this two-phase paradigm.

5. Algorithmic Innovations and Analysis

Two key innovations distinguish SD-Zero:

  • Token-level Self-localization: Although only binary reward feedback is observed, the KL-divergence loss over tokens is highly concentrated—incorrect tokens in failed traces accrue the majority of error mass. This provides a powerful credit assignment mechanism which identifies the locus of revision using only reward, seen in empirical “credit-assignment” heatmaps. The effect is transformation of sparse supervision into directed, dense per-token learning.

  • Iterative Self-Evolution: Through periodic teacher-student weight synchronization during Phase 2, the model repeatedly improves its self-revision capacity, creating a bootstrap loop (revision r(y,a){0,1},r(y,a)=1    final answer in y matches a.r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.3 distillation r(y,a){0,1},r(y,a)=1    final answer in y matches a.r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.4 stronger revision). Experiments show that one round of teacher synchronization after each epoch yields further accuracy gains (approximately +3 percentage points), and iterative improvement without plateauing.

6. Training Efficiency and Hyperparameterization

SD-Zero is explicitly sample-efficient relative to its baselines. For a representative training run:

  • Sample budget: 40k generations in Phase 1 (10k initial, 30k revised), 9k in Phase 2; total 49k, reducing LLM calls by approximately 18% compared to RFT, GRPO, and SDFT (which use r(y,a){0,1},r(y,a)=1    final answer in y matches a.r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.560k generations).

  • Token budget: 224M completion tokens, matching baselines.

  • Hyperparameters (Qwen3-4B):

    • Phase 1: batch size 4, learning rate 5e-6, epochs 3, 5% warmup, weight decay 1e-4, bfloat16, FSDP full-shard.
    • Phase 2: global batch 128, micro-batch 1, top-K distill 64, learning rate 5e-6, 4 GPUs.
    • Generation: temperature 1.0, top-p 1.0, max 8k tokens per sample.
  • Teacher synchronization: In self-evolution, weights are synced every epoch or at specified intervals.

7. Empirical Results and Ablations

Evaluation across 8 mathematics and code reasoning benchmarks (AIME24/25, HMMT25, AMOBench, OpenR1-Math, MATH, Codeforces, LiveCodeBench) with two model families (Qwen3-4B-Instruct, Olmo-3-7B-Instruct) shows:

  • Phase 1 alone achieves gains of +7.8 percentage points (Qwen) and +9.2 pp (Olmo) over base, outperforming RFT and SFT on human demos.
  • Full SD-Zero yields +10.5/+10.4 pp over base, with additional gains from Phase 2 and substantial reductions in average response length (increased token efficiency).
  • Comparison with RFT, GRPO, SDFT: SD-Zero exceeds each by at least 4.8 pp, despite requiring neither gold demonstrations (unlike SDFT) nor multiple per-question rollouts (unlike GRPO).
  • Pass@8 metric: SD-Zero produces the highest multi-sample success rates on math tasks, indicating improvement beyond superficial answer sharpening.
  • Ablation studies:
    • Removing r(y,a){0,1},r(y,a)=1    final answer in y matches a.r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.6 halves correction rates, while omitting r(y,a){0,1},r(y,a)=1    final answer in y matches a.r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.7 degrades overall accuracy (from 57.6 to 52.2), demonstrating complementary necessity.
    • Skipping Phase 1 produces only marginal gains; self-revision ability must be seeded by Phase 1.
    • Data allocation studies suggest optimal performance when Phase 2’s self-distillation utilizes a maximal share of samples, post minimal effective Phase 1.

8. Significance and Implications

SD-Zero demonstrates that an LLM can reliably self-improve under only a binary end-task reward, by internalizing its own self-revision process via dense token-level on-policy distillation. This approach matches or exceeds traditional methods that rely on much richer forms of supervision. The method’s sample efficiency, ability to localize errors to tokens, and iterative self-evolution all contribute to its empirical effectiveness. A plausible implication is that techniques such as SD-Zero may provide a template for scalable, supervision-light post-training in other domains where only coarse reward signals are available (He et al., 13 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SD-Zero.