SD-Zero: Self-Distillation via Binary Rewards

Updated 17 April 2026

SD-Zero is a post-training method that leverages binary reward signals to create dense, token-level self-supervision for large language models.
It employs a dual-mode architecture where a single model acts as both generator and reviser, using control prompts based on binary verifications.
Empirical results demonstrate that SD-Zero delivers higher sample efficiency and improved performance on math and code benchmarks compared to established methods.

Self-Distillation Zero (SD-Zero) is a post-training method for LLMs designed to transform sparse binary rewards into dense token-level self-supervision. The method leverages binary verifiers—functions that accept or reject a solution based solely on its final answer—eliminating the need for gold solution traces, high-quality demonstrations, or external teacher models. SD-Zero innovates by training a single model to alternately fulfill two roles: a Generator, which proposes candidate solutions, and a Reviser, which self-edits these solutions in light of binary reward signals. Empirical evidence demonstrates that SD-Zero achieves higher sample efficiency and superior downstream performance on math and code reasoning benchmarks compared to baselines like Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), all under matched training budgets (He et al., 13 Apr 2026).

1. Formal Problem Framework

SD-Zero operates in a setting with a dataset $\mathcal{D} = \{(x_i,a_i)\}_{i=1}^N$ , where each $x$ is a reasoning problem (mathematics or code) and $a$ its canonical answer. The critical restriction is the absence of gold reasoning traces; only a verifier function provides rewards: $r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.$ The parameterized policy $\pi_\theta$ generates a response $y=(y_1,\dots,y_T)$ for input $x$ , modeling the distribution

$\pi_\theta(y\mid x) = \prod_{t=1}^T \pi_\theta(y_t\mid x, y_{<t}).$

The objective is to maximize the probability of correct solutions with supervision consisting only of these binary feedbacks.

2. Generator and Reviser Model Roles

SD-Zero is characterized by two operational model roles:

Generator: Given $x$ , samples a candidate solution $y_{\mathrm{init}} \sim \pi_\theta(\cdot\mid x)$ . This employs an autoregressive, decoder-only transformer (e.g., Qwen-3-4B, Olmo-3-7B) with softmax token prediction.
Reviser: Accepts $x$ 0, where $x$ 1 is a control prompt encoding the binary reward—either instructing “start over” if $x$ 2 or “rephrase” if $x$ 3. The reviser generates a new trace $x$ 4 by conditioning on the initial response and the reward signal.

The following table summarizes the two roles:

Role	Input(s)	Output
Generator	$x$ 5	$x$ 6
Reviser	$x$ 7	$x$ 8

Both modes share architecture and parameters, differing only in input composition.

3. Training Procedure and Objectives

The SD-Zero algorithm comprises two phases: self-revision training followed by on-policy self-distillation.

Phase 1: Self-Revision Training
- For each $x$ 9, the Generator produces $a$ 0. The reviser generates $a$ 1 using the reward-informed control prompt $a$ 2.
- The revision instance $a$ 3 is retained if $a$ 4.
- The joint supervised loss has two terms:
- Revision loss:
$a$ 5 - Generation loss (for chain of initial plus revised response):

$a$ 6 - Combined objective: $a$ 7.
Phase 2: On-Policy Self-Distillation
- The student is initialized as $a$ 8 from Phase 1. The teacher is $a$ 9 run in reviser mode.
- For each $r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.$ 0: sample $r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.$ 1 from student, use reward to condition teacher, then distill the reviser's next-token distributions into the Generator by minimizing KL-divergence:
$r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.$ 2 - Teacher weights are periodically synchronized to those of the improved student, enabling iterative self-evolution.

4. Algorithmic Structure and Pseudocode

The SD-Zero training loop is formalized as follows:

$r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.$ 8

This framework eliminates the need for external teacher models and high-quality traces, and all data-dependent curricula are instantiations of this two-phase paradigm.

5. Algorithmic Innovations and Analysis

Two key innovations distinguish SD-Zero:

Token-level Self-localization: Although only binary reward feedback is observed, the KL-divergence loss over tokens is highly concentrated—incorrect tokens in failed traces accrue the majority of error mass. This provides a powerful credit assignment mechanism which identifies the locus of revision using only reward, seen in empirical “credit-assignment” heatmaps. The effect is transformation of sparse supervision into directed, dense per-token learning.
Iterative Self-Evolution: Through periodic teacher-student weight synchronization during Phase 2, the model repeatedly improves its self-revision capacity, creating a bootstrap loop (revision $r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.$ 3 distillation $r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.$ 4 stronger revision). Experiments show that one round of teacher synchronization after each epoch yields further accuracy gains (approximately +3 percentage points), and iterative improvement without plateauing.

6. Training Efficiency and Hyperparameterization

SD-Zero is explicitly sample-efficient relative to its baselines. For a representative training run:

Sample budget: 40k generations in Phase 1 (10k initial, 30k revised), 9k in Phase 2; total 49k, reducing LLM calls by approximately 18% compared to RFT, GRPO, and SDFT (which use $r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.$ 560k generations).
Token budget: 224M completion tokens, matching baselines.
Hyperparameters (Qwen3-4B):
- Phase 1: batch size 4, learning rate 5e-6, epochs 3, 5% warmup, weight decay 1e-4, bfloat16, FSDP full-shard.
- Phase 2: global batch 128, micro-batch 1, top-K distill 64, learning rate 5e-6, 4 GPUs.
- Generation: temperature 1.0, top-p 1.0, max 8k tokens per sample.
Teacher synchronization: In self-evolution, weights are synced every epoch or at specified intervals.

7. Empirical Results and Ablations

Evaluation across 8 mathematics and code reasoning benchmarks (AIME24/25, HMMT25, AMOBench, OpenR1-Math, MATH, Codeforces, LiveCodeBench) with two model families (Qwen3-4B-Instruct, Olmo-3-7B-Instruct) shows:

Phase 1 alone achieves gains of +7.8 percentage points (Qwen) and +9.2 pp (Olmo) over base, outperforming RFT and SFT on human demos.
Full SD-Zero yields +10.5/+10.4 pp over base, with additional gains from Phase 2 and substantial reductions in average response length (increased token efficiency).
Comparison with RFT, GRPO, SDFT: SD-Zero exceeds each by at least 4.8 pp, despite requiring neither gold demonstrations (unlike SDFT) nor multiple per-question rollouts (unlike GRPO).
Pass@8 metric: SD-Zero produces the highest multi-sample success rates on math tasks, indicating improvement beyond superficial answer sharpening.
Ablation studies:
- Removing $r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.$ 6 halves correction rates, while omitting $r(y, a) \in \{0,1\},\quad r(y, a) = 1\iff \text{final answer in } y \text{ matches } a.$ 7 degrades overall accuracy (from 57.6 to 52.2), demonstrating complementary necessity.
- Skipping Phase 1 produces only marginal gains; self-revision ability must be seeded by Phase 1.
- Data allocation studies suggest optimal performance when Phase 2’s self-distillation utilizes a maximal share of samples, post minimal effective Phase 1.

8. Significance and Implications

SD-Zero demonstrates that an LLM can reliably self-improve under only a binary end-task reward, by internalizing its own self-revision process via dense token-level on-policy distillation. This approach matches or exceeds traditional methods that rely on much richer forms of supervision. The method’s sample efficiency, ability to localize errors to tokens, and iterative self-evolution all contribute to its empirical effectiveness. A plausible implication is that techniques such as SD-Zero may provide a template for scalable, supervision-light post-training in other domains where only coarse reward signals are available (He et al., 13 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SD-Zero.