SD-Zero: Self-Distillation via Binary Rewards
- SD-Zero is a post-training method that leverages binary reward signals to create dense, token-level self-supervision for large language models.
- It employs a dual-mode architecture where a single model acts as both generator and reviser, using control prompts based on binary verifications.
- Empirical results demonstrate that SD-Zero delivers higher sample efficiency and improved performance on math and code benchmarks compared to established methods.
Self-Distillation Zero (SD-Zero) is a post-training method for LLMs designed to transform sparse binary rewards into dense token-level self-supervision. The method leverages binary verifiers—functions that accept or reject a solution based solely on its final answer—eliminating the need for gold solution traces, high-quality demonstrations, or external teacher models. SD-Zero innovates by training a single model to alternately fulfill two roles: a Generator, which proposes candidate solutions, and a Reviser, which self-edits these solutions in light of binary reward signals. Empirical evidence demonstrates that SD-Zero achieves higher sample efficiency and superior downstream performance on math and code reasoning benchmarks compared to baselines like Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), all under matched training budgets (He et al., 13 Apr 2026).
1. Formal Problem Framework
SD-Zero operates in a setting with a dataset , where each is a reasoning problem (mathematics or code) and its canonical answer. The critical restriction is the absence of gold reasoning traces; only a verifier function provides rewards: The parameterized policy generates a response for input , modeling the distribution
The objective is to maximize the probability of correct solutions with supervision consisting only of these binary feedbacks.
2. Generator and Reviser Model Roles
SD-Zero is characterized by two operational model roles:
- Generator: Given , samples a candidate solution . This employs an autoregressive, decoder-only transformer (e.g., Qwen-3-4B, Olmo-3-7B) with softmax token prediction.
- Reviser: Accepts 0, where 1 is a control prompt encoding the binary reward—either instructing “start over” if 2 or “rephrase” if 3. The reviser generates a new trace 4 by conditioning on the initial response and the reward signal.
The following table summarizes the two roles:
| Role | Input(s) | Output |
|---|---|---|
| Generator | 5 | 6 |
| Reviser | 7 | 8 |
Both modes share architecture and parameters, differing only in input composition.
3. Training Procedure and Objectives
The SD-Zero algorithm comprises two phases: self-revision training followed by on-policy self-distillation.
- Phase 1: Self-Revision Training
- For each 9, the Generator produces 0. The reviser generates 1 using the reward-informed control prompt 2.
- The revision instance 3 is retained if 4.
- The joint supervised loss has two terms:
- Revision loss:
5 - Generation loss (for chain of initial plus revised response):
6 - Combined objective: 7.
Phase 2: On-Policy Self-Distillation
- The student is initialized as 8 from Phase 1. The teacher is 9 run in reviser mode.
- For each 0: sample 1 from student, use reward to condition teacher, then distill the reviser's next-token distributions into the Generator by minimizing KL-divergence:
2 - Teacher weights are periodically synchronized to those of the improved student, enabling iterative self-evolution.
4. Algorithmic Structure and Pseudocode
The SD-Zero training loop is formalized as follows:
8
This framework eliminates the need for external teacher models and high-quality traces, and all data-dependent curricula are instantiations of this two-phase paradigm.
5. Algorithmic Innovations and Analysis
Two key innovations distinguish SD-Zero:
Token-level Self-localization: Although only binary reward feedback is observed, the KL-divergence loss over tokens is highly concentrated—incorrect tokens in failed traces accrue the majority of error mass. This provides a powerful credit assignment mechanism which identifies the locus of revision using only reward, seen in empirical “credit-assignment” heatmaps. The effect is transformation of sparse supervision into directed, dense per-token learning.
Iterative Self-Evolution: Through periodic teacher-student weight synchronization during Phase 2, the model repeatedly improves its self-revision capacity, creating a bootstrap loop (revision 3 distillation 4 stronger revision). Experiments show that one round of teacher synchronization after each epoch yields further accuracy gains (approximately +3 percentage points), and iterative improvement without plateauing.
6. Training Efficiency and Hyperparameterization
SD-Zero is explicitly sample-efficient relative to its baselines. For a representative training run:
Sample budget: 40k generations in Phase 1 (10k initial, 30k revised), 9k in Phase 2; total 49k, reducing LLM calls by approximately 18% compared to RFT, GRPO, and SDFT (which use 560k generations).
Token budget: 224M completion tokens, matching baselines.
Hyperparameters (Qwen3-4B):
- Teacher synchronization: In self-evolution, weights are synced every epoch or at specified intervals.
7. Empirical Results and Ablations
Evaluation across 8 mathematics and code reasoning benchmarks (AIME24/25, HMMT25, AMOBench, OpenR1-Math, MATH, Codeforces, LiveCodeBench) with two model families (Qwen3-4B-Instruct, Olmo-3-7B-Instruct) shows:
- Phase 1 alone achieves gains of +7.8 percentage points (Qwen) and +9.2 pp (Olmo) over base, outperforming RFT and SFT on human demos.
- Full SD-Zero yields +10.5/+10.4 pp over base, with additional gains from Phase 2 and substantial reductions in average response length (increased token efficiency).
- Comparison with RFT, GRPO, SDFT: SD-Zero exceeds each by at least 4.8 pp, despite requiring neither gold demonstrations (unlike SDFT) nor multiple per-question rollouts (unlike GRPO).
- Pass@8 metric: SD-Zero produces the highest multi-sample success rates on math tasks, indicating improvement beyond superficial answer sharpening.
- Ablation studies:
- Removing 6 halves correction rates, while omitting 7 degrades overall accuracy (from 57.6 to 52.2), demonstrating complementary necessity.
- Skipping Phase 1 produces only marginal gains; self-revision ability must be seeded by Phase 1.
- Data allocation studies suggest optimal performance when Phase 2’s self-distillation utilizes a maximal share of samples, post minimal effective Phase 1.
8. Significance and Implications
SD-Zero demonstrates that an LLM can reliably self-improve under only a binary end-task reward, by internalizing its own self-revision process via dense token-level on-policy distillation. This approach matches or exceeds traditional methods that rely on much richer forms of supervision. The method’s sample efficiency, ability to localize errors to tokens, and iterative self-evolution all contribute to its empirical effectiveness. A plausible implication is that techniques such as SD-Zero may provide a template for scalable, supervision-light post-training in other domains where only coarse reward signals are available (He et al., 13 Apr 2026).