Papers
Topics
Authors
Recent
2000 character limit reached

Two-Stage Diversity Distillation

Updated 14 November 2025
  • The paper introduces a two-phase pipeline that first generates diverse solution spectra via supervised fine-tuning and then refines valid reasoning with a reinforcement learning signal-phase.
  • It employs MaxEnt-Guided Policy Optimization (MGPO) to target queries near maximum uncertainty, achieving a 3–4 point Pass@1 gain on benchmarks.
  • By decoupling solution generation from signal amplification, the approach enables small language models to exhibit reasoning capabilities akin to much larger systems.

Two-Stage Diversity-Exploring Distillation is a supervised and reinforcement-learning pipeline designed to endow relatively small LLMs (e.g., VibeThinker-1.5B, 1.5B parameters) with reasoning and problem-solving capabilities that rival much larger-scale systems. This approach, central to the Spectrum-to-Signal Principle (SSP), is characterized by an initial supervised fine-tuning (SFT) "spectrum phase," in which a broad diversity of solutions is generated, followed by a reinforcement learning (RL) "signal phase," where MaxEnt-Guided Policy Optimization (MGPO) amplifies the correct and logically consistent responses discovered earlier. This two-stage process, departing from pure scale-driven regimes, emphasizes combinatorial coverage (“diversity-exploring”) and curriculum-guided sharpening of reasoning ability.

1. Conceptual Overview and Motivation

Two-Stage Diversity-Exploring Distillation is motivated by the observation that small models, trained conventionally, lag substantially in Pass@1 reasoning accuracy on math, logic, and code domains compared to their larger counterparts. The root challenge is attributed not solely to parameter count but to a lack of effective search and selection in SFT and RL. The method addresses this by decoupling the identification of valid reasoning traces from their amplification, focusing the RL signal via entropy-driven adaptive weighting rather than global reward averages. This process is implemented as part of the Spectrum-to-Signal Principle, which challenges the prevailing paradigm of pure model scaling and instead insists that sufficient diversity—followed by rigorous distillation—can elicit complex abilities in dense models several orders of magnitude smaller than state-of-the-art gigamodels (Xu et al., 9 Nov 2025).

2. Stage One: Spectrum Generation through Diversity-Driven SFT

The first stage centers on creating a "spectrum" of potential solution paths for each task or query, typically via large-batch SFT with explicit diversity encouragement:

  • For each query qq, the base model or early-stage SFT model samples GG independent completions {yi}i=1G\{y_i\}_{i=1}^{G}.
  • Solutions are not prematurely filtered or collapsed; instead, the full set—including rare but correct and logically challenging traces—is preserved.
  • The diversity of sampled completions (high Pass@K with K>1K > 1) is critical, ensuring coverage of multiple plausible solution modes.

This spectrum, or ensemble of candidate outputs, is explicitly retained to fuel more selective downstream optimization in the RL phase.

3. Stage Two: MaxEnt-Guided Policy Optimization (MGPO) for Signal Amplification

MGPO implements the "signal" phase by focusing reinforcement learning updates on queries where the current policy exhibits maximum epistemic uncertainty (i.e., pc(q)0.5p_c(q) \approx 0.5 correct answer rate), using an information-theoretically principled objective:

  • For each query qq, define empirical correct-answer probability:

pc(q)=1Gi=1G1[ri=1]p_c(q) = \frac{1}{G} \sum_{i=1}^G \mathbb{1}[r_i = 1]

where ri{0,1}r_i \in \{0,1\} denotes binary correctness.

  • The maximum-entropy (ME) deviation is computed as KL-divergence from uniform:

DME(pc(q)0.5)=pc(q)logpc(q)0.5+(1pc(q))log1pc(q)0.5D_{ME}(p_c(q)\|0.5) = p_c(q)\log\frac{p_c(q)}{0.5} + (1-p_c(q))\log\frac{1-p_c(q)}{0.5}

  • Weight for each task is determined by

wME(pc(q))=exp(λDME(pc(q)0.5))w_{ME}(p_c(q)) = \exp\big(-\lambda D_{ME}(p_c(q)\|0.5)\big)

with λ\lambda controlling sharpness.

  • The classic group-relative advantage for RL is reweighted:

Ai,t(q)=wME(pc(q))Ai,t(q)\mathcal{A}'_{i,t}(q) = w_{ME}(p_c(q))\cdot\mathcal{A}_{i,t}(q)

  • MGPO’s policy-gradient surrogate objective aggregates these weighted advantages, with PPO-style ratio clipping to stabilize updates.

Thus, sample efficiency and optimization focus are both increased by concentrating RL’s effect on examples where the policy is neither succeeding with certainty nor failing universally—that is, at the “learning frontier.”

4. Implementation Details and Algorithmic Steps

The practical workflow for Two-Stage Diversity-Exploring Distillation is as follows:

  1. Supervised Spectrum Generation:
    • Sample G=832G = 8-32 completions per query.
    • Evaluate each with a reward function; retain all completions and correctness indicators.
  2. MGPO Signal Extraction:
    • For each RL update batch:
      • Sample a set of queries QQ from the training set.
      • For each qQq\in Q:
      • Compute pc(q)p_c(q), DMED_{ME}, wMEw_{ME}, mean μ\mu, std σ\sigma of rewards.
      • For all sampled rollouts, compute policy ratios ρ\rho, standard/group-relative advantage, and weighted advantage.
      • Compute policy gradient with PPO-style clipping, step the parameters, and optionally update the reference policy (θold\theta_{\text{old}}).
    • Hyperparameters include λ[1,10]\lambda\in[1,10] (entropy sharpness), ϵ0.10.2\epsilon \sim 0.1-0.2 (clip), batch size B=32128B=32-128, and learning rate $1e$-6 to $5e$-5.
    • Hardware setup utilizes mixed-precision (fp16) on NVIDIA H800, large-batch vLLM sampling, and gradient norm clipping.

A summary of key steps is outlined below.

Stage Core Operation Output/Effect
Spectrum (SFT) Sample GG completions per qq Broad solution diversity
Signal (MGPO, RL) Weighted policy gradient on pc0.5p_c \approx 0.5 Amplifies correct traces

5. Theoretical and Empirical Rationale

The methodology is rooted in maximum-entropy learning, where focusing on tasks with pc0.5p_c \approx 0.5 ensures that the update signal is maximally informative. By exponentiating the KL-divergence from equiprobability, MGPO imposes a smooth curriculum, encouraging the model to target tasks at the edge of its capability. Implicitly, this avoids reinforcing confidently correct (or incorrect) responses, which supply vanishingly small empirical gradient, and instead targets the frontier of potential improvement. Empirical ablation on the AIME25 benchmark demonstrates a 3–4 point Pass@1 gain over standard Group Relative Policy Optimization (GRPO), with final MGPO models scoring 74.4 (vs. 70.0 baseline). Tuning λ\lambda and rollout count GG offers predictable, monotonic trade-offs in performance and stability (Xu et al., 9 Nov 2025).

6. Comparative Performance and Impact on Model Scaling

Deployment of Two-Stage Diversity-Exploring Distillation in VibeThinker-1.5B achieves strong performance on math (AIME24, AIME25, HMMT25) and code (LiveCodeBench V6) benchmarks, meeting or exceeding larger closed-source (Magistral Medium, Claude Opus 4) and open-source models (GPT OSS-20B Medium), and outperforming 400x larger DeepSeek R1 on select metrics, with substantially reduced training and inference budgets ($7,800 total cost). A plausible implication is that correct reasoning “traces,” once discovered and amplified by MGPO, can compensate for parameter count by increasing the effective complexity and logical depth of output distributions. This challenges the established scaling laws in favor of algorithmic and curriculum-based improvements.

7. Practical Considerations and Limitations

The approach is reliant on high-quality, diverse spectrum generation in the SFT phase; insufficient diversity limits the signal-to-amplify in the RL phase. MGPO requires adequate rollout counts per task to stabilize estimation of $p_c(q),andtuning, and tuning \lambda$ is required to avoid both over-focusing (forgetting easy tasks) and under-focusing (diluting the curriculum). While the underlying objectives are general, realized gains are currently concentrated in domains with well-defined correctness criteria (math, code, logic). Extension to open-ended reasoning and subjective tasks remains to be established.


Two-Stage Diversity-Exploring Distillation, as operationalized in the Spectrum-to-Signal Principle, provides an information-theoretically grounded and empirically validated pipeline for transferring large-model reasoning ability to dense models of moderate size, primarily through explicit diversification and entropy-driven sharpening of policy optimization (Xu et al., 9 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Two-Stage Diversity-Exploring Distillation.