On-Policy Self-Distillation with Sampled Demos

Updated 26 June 2026

The paper introduces a novel on-policy self-distillation method that leverages sampled correct rollouts to provide dense, token-level supervision.
It demonstrates state-of-the-art performance in tasks such as search-augmented reasoning, code synthesis, and continual learning with notable pass@1 improvements.
SDSD eliminates the need for large external teachers while trading off output diversity, posing challenges for out-of-distribution generalization.

On-Policy Self-Distillation with Sampled Demonstrations (SDSD) is a family of training approaches for LLMs and sequence generation agents that interpolate between conventional supervised fine-tuning (SFT) and outcome-reward reinforcement learning (RL). SDSD leverages the policy’s own successful behaviour: the model is trained as both student and teacher, with the teacher conditioned on sampled correct rollouts (demonstrations) to densely supervise the student at every decision point. Unlike process-supervision or classical imitation learning, SDSD operates fully on-policy and obviates the need for large external teachers or annotated intermediate steps, instead extracting step-level or token-level credit signals from its own past trajectories. This paradigm has produced state-of-the-art performance in search-augmented reasoning, program synthesis, and continual learning, but exhibits distinct theoretical and practical trade-offs, particularly in output diversity and generalization.

1. Formal Problem Statement and Objective

SDSD addresses on-policy sequence learning tasks where the environment provides only sparse signals, such as pass/fail feedback or scalar rewards, after a complete rollout. Let $\pi_\theta$ be an autoregressive policy model that, given input $x$ , generates a trajectory $y = (y_1, \dots, y_T)$ , with per-token choices $y_t \in \mathcal V$ . Classical RL approaches operate with outcome-level rewards $r(y|x)$ and update $\pi_\theta$ accordingly; this results in weak, trajectory-level credit assignment, especially detrimental in settings requiring compositional reasoning or search decisions.

SDSD augments each training batch with "sampled demonstrations"—high-reward (i.e., verified correct) rollouts collected on-policy. These are used as privileged context to construct a teacher distribution $\pi_{\text{T}}(\cdot | x, d)$ for each input $x$ and demonstration $d$ . The core learning objective is to minimize the divergence (typically Jensen–Shannon or KL) between the student’s distribution $\pi_\theta(\cdot|x)$ and the teacher’s feedback-informed distribution $x$ 0, at selected positions or tokens (e.g., search tokens, code completions, or full trajectories), while maintaining on-policy sampling.

Mathematically, for a batch of demonstrations $x$ 1 associated with input $x$ 2, the typical SDSD objective is: $x$ 3 where only the student is updated (stop-gradient on teacher). This dense, token-level feedback stands in contrast to trajectory-level RL rewards.

2. Algorithmic Structure and Mechanisms

The SDSD pipeline comprises several key steps:

On-policy sampling: For each training iteration and input $x$ 4, sample a group of $x$ 5 rollouts $x$ 6 from $x$ 7. Evaluate each rollout with an environment or verifier to determine correctness.
Demonstration selection: Select successful rollouts $x$ 8 (reward $x$ 9) as demonstrations. For each failed (or in general, all) rollouts, sample a demonstration $y = (y_1, \dots, y_T)$ 0 from $y = (y_1, \dots, y_T)$ 1.
Teacher/student conditioning: The student predicts each token conditioned only on its own prefix. The teacher is conditioned on both the same context and the selected demonstration (or, in feedback-augmented settings, on feedback as well).
Distillation loss computation: At designated positions (e.g., search queries), compute the Jensen–Shannon or KL divergence between student and teacher distributions. For example, SD-Search aligns only at search-query tokens using JSD, while more general frameworks may use per-token KL over the whole sequence.
Joint/auxiliary loss: SDSD loss may be combined with an RL loss such as GRPO, with a small weighting coefficient, providing both dense local credit and coarse trajectory-level updates.
On-policy updating: All rollouts are generated on-policy with the current student; gradients are propagated only through the student.

A typical loop (for code or reasoning tasks (Hübotter et al., 28 Jan 2026, Ma et al., 18 May 2026)) is as follows:

$y_t \in \mathcal V$ 2

This structure applies across domains, including science QA (Ma et al., 18 May 2026), competitive programming (Hübotter et al., 28 Jan 2026), continual learning (Shenfeld et al., 27 Jan 2026), and controlled reasoning tasks (Nicolicioiu et al., 24 Jun 2026).

3. Theoretical Analysis and Distinction from RL

SDSD is not equivalent to ideal on-policy RL, and theoretical results highlight this divergence. Formal analysis (Nicolicioiu et al., 24 Jun 2026) demonstrates:

SDSD optimal distribution: With a base model $y = (y_1, \dots, y_T)$ 2 and correct demonstration distribution $y = (y_1, \dots, y_T)$ 3, the SDSD-optimized policy is

$y = (y_1, \dots, y_T)$ 4

where $y = (y_1, \dots, y_T)$ 5 is the pointwise conditional mutual information (PCMI).

RL preserves mode ratios: In contrast, the optimal RL policy with KL regularization preserves base-probability ratios among equally correct trajectories. SDSD amplifies pre-existing probability gaps whenever sampled demonstrations favor specific trajectories, resulting in a “rich get richer” effect.
Diversity collapse: The compounding bias towards already-likely correct rollouts arises because the same base distribution generates both student and teacher behaviours, and correct demonstrations are sampled from the policy’s current support. The theoretical and empirical implication is reduced functional and semantic diversity relative to RL methods, particularly in settings where multiple diverse rollouts are desirable.

4. Empirical Results and Performance Characteristics

SDSD has been empirically evaluated against strong RL and SFT baselines across multiple domains:

Search-Augmented QA (Ma et al., 18 May 2026): With Qwen2.5-3B and Qwen2.5-7B, SDSD (SD-Search) achieves average EM scores of 0.428 and 0.476, surpassing RL methods such as AutoRefine and MR-Search and matching process-supervision approaches that require large external teachers. Gains concentrate on multi-hop reasoning tasks.
Scientific reasoning and code generation (Hübotter et al., 28 Jan 2026): SDSD outperforms optimized GRPO by +4–6 accuracy points and achieves shorter, more efficient solution chains—requiring 4–10× fewer generations to hit comparable top-1 accuracy in benchmarks such as LiveCodeBench.
Continual learning (Shenfeld et al., 27 Jan 2026): SDSD (as SDFT) enables models to accumulate new skills without catastrophic forgetting, surpassing supervised fine-tuning in both new-task and prior-task accuracy (e.g., Science QA: 70/65 for SDSD vs. 66/53 for SFT).
Diversity and OOD generalization (Nicolicioiu et al., 24 Jun 2026): Despite strong pass@1, SDSD models exhibit flattened pass@k curves (i.e., top-k sampling yields little additional coverage), reduced semantic diversity in rollouts, and brittle performance on distributionally shifted tasks that require diverse or atypical reasoning paths.

Ablations confirm:

Removing outcome labels or future-masking from the teacher context degrades accuracy;
Using broader distillation scope (all tokens vs. designated positions) is less effective;
Hyperparameters such as group size and top-K truncation show mild inverted-U sensitivity.

5. Implementation Protocols and Hyperparameters

SDSD introduces only modest overhead to standard training pipelines:

Backbone models: Qwen2.5-3B, Qwen2.5-7B, Olmo-3-7B, and similar autoregressive LLMs;
Demonstration group size: Typically $y = (y_1, \dots, y_T)$ 6;
Distillation strength: Auxiliary loss scaled by $y = (y_1, \dots, y_T)$ 7;
KL/JSD approximation: Top-K truncation (e.g., $y = (y_1, \dots, y_T)$ 8) for per-token divergences;
Optimization: AdamW with low learning rate ( $y = (y_1, \dots, y_T)$ 9), standard PPO clipping $y_t \in \mathcal V$ 0;
Self-teacher regularization: Either single-model teacher via stop-gradient or momentum/EMA copy updated as $y_t \in \mathcal V$ 1;
Prompt construction: Teacher sees context plus demonstration or feedback; student conditioned only on context.

No external model inference, no additional annotation pipeline, and no reward shaping are required; the training protocol remains purely on-policy.

6. Limitations and Trade-Offs

The main limitation of SDSD is output diversity collapse. Empirical and theoretical evidence (Nicolicioiu et al., 24 Jun 2026) demonstrates:

SDSD consistently boosts pass@1 accuracy but flattens pass@k curves; generating more samples fails to increase solution diversity;
In combinatorial reasoning or structured domain tasks, SDSD concentrates probability mass on dominant solution modes. When multiple correct strategies exist, rare or atypical correct modes are underrepresented or entirely eliminated;
This poses challenges for OOD generalization and applications demanding solution set diversity (e.g., planning, discovery, robust factual retrieval).

Remedies discussed involve augmenting SDSD with explicit diversity regularizers (e.g., mutual information bonuses or ensemble teachers) and monitoring functional/semantic diversity metrics, not just token-level entropy. A plausible implication is that SDSD is well-suited for applications where top-1 reliability is paramount, but practitioners must exercise caution or augmentations when facing tasks requiring exhaustive search or robust exploration.

7. Relationship to Other Process-Supervision and Imitation Methods

SDSD stands distinct from process-supervision and imitation learning in several critical respects:

No external teacher dependence: Unlike process-supervision frameworks relying on much stronger annotators or subquestion ground-truth, SDSD is purely self-supervised through on-policy demonstration sampling.
Dense feedback signal: In contrast to SFT or RLVR, which provide off-policy or trajectory-level supervision, SDSD offers local, dense, token- or step-level gradient signals, directly bridging the credit assignment gap.
Continual learning utility: SDSD methods mitigate catastrophic forgetting, as shown in multi-stage skill/knowledge acquisition tasks, outperforming SFT and matching RL in preserving prior competencies (Shenfeld et al., 27 Jan 2026).

The methodology is now a central paradigm in fine-tuning and ongoing post-training of LLMs, but its mode-seeking tendencies and diversity impacts remain a subject of active research and careful evaluation.