Two-Stage Diversity-Exploring Distillation

Updated 12 November 2025

The paper introduces a two-stage approach that first explores a broad spectrum of model outputs and then fuses these diverse modes to maintain accuracy and robustness.
The methodology leverages diversity metrics like Pass@K and DreamSim to guide optimal checkpoint selection and model parameter averaging across subdomains.
The approach outperforms standard single-objective distillation in terms of diversity, calibration, and inference speed in both language and diffusion model tasks.

Two-Stage Diversity-Exploring Distillation (SFT) encompasses a family of optimization strategies designed to explicitly promote sample or function-space diversity during distillation or fine-tuning, usually in neural network-based generative modeling and supervised learning contexts. Unlike standard single-objective distillation or supervised fine-tuning, which tend to collapse to a single dominant output or narrow hypothesis mode, two-stage diversity-exploring distillation deliberately injects diversity-seeking objectives and selection criteria at intermediate points, followed by a signal amplification or merging phase. This architecture is particularly prominent in recent diffusion model acceleration (Gandikota et al., 13 Mar 2025), compact LLM fine-tuning (Xu et al., 9 Nov 2025), and ensemble-to-single-model compression (Nam et al., 2021).

1. Conceptual Motivation and Theoretical Rationale

Standard distillation and fine-tuning workflows in neural sequence models, diffusion models, and deep classifiers prioritize accuracy, likelihood, or single-point performance metrics (e.g., Pass@1, clean dataset likelihood). This tends to cause mode collapse, where the learned distribution omits many plausible solutions, reducing sample diversity and limiting robustness, reasoning coverage, or creativity. Two-stage diversity-exploring distillation directly addresses this deficit by separating:

Diversity Construction: The first stage seeks a broad "spectrum" of function modes, output behaviors, or reasoning chains, using explicit diversity metrics (e.g., ensemble disagreement, Pass@K, DreamSim distance, high-variance output statistics).
Signal Amplification or Fusion: The second stage consolidates the discovered diversity via either expert model fusion, hybrid-sampler switching, or loss-weighted merging, ensuring that the final model maintains both accuracy and high diversity.

The Spectrum-to-Signal Principle (SSP) formalizes the advantage of this decoupling: by maximizing exploration first (often under constraints such as subdomain specialization or targeted perturbation), subsequent optimization (RL, averaging, distillation) operates over a richer hypothesis space, consistently yielding higher single-shot and multi-sample performance (Xu et al., 9 Nov 2025).

2. Methodological Frameworks and Algorithms

2.1 Domain-Aware Diversity Probing and Model Fusion

In language or reasoning models, stage one involves partitioning the domain into N subdomains (e.g., algebra, geometry, code, knowledge; N=4 in (Xu et al., 9 Nov 2025)). For each subdomain $S_i$ , a probing set $D_i$ is constructed. During fine-tuning, periodic checkpoints $M_t$ are evaluated by Pass@K—the fraction of test questions for which at least one of $K$ generated outputs is exactly correct: $P_i(t) = \mathrm{Pass@}K\bigl(M_t;D_i\bigr) = \frac{1}{|D_i|}\sum_{(q,a)\in D_i} \Pr_{y_1,\dots,y_K\sim \pi_{M_t}(\cdot\mid q)} \left[ \max_{k=1\dots K} R(q, y_k) = 1 \right]$ where $R(q, y)$ is an exact solution checker. The optimal checkpoint for subdomain $i$ is selected: $M_i^* = \arg\max_t P_i(t)$

The specialists $\{M_i^*\}_{i=1..N}$ are then merged into one composite SFT model by parameter averaging: $\theta_{\mathrm{merge}} = \sum_{i=1}^N w_i\,\theta_i^*, \qquad w_i \ge 0,\ \sum_i w_i = 1$ Uniform weights $w_i=1/N$ are standard. The resulting model preserves a union of diverse solution modes.

2.2 Hybrid Inference in Diffusion Models

For distilled diffusion models, stage one consists of running the base (slow, high-diversity) sampler $f_\theta$ for the first $\tau$ (typically $\tau=1$ ) denoising steps in the reverse process, with the remainder handled by the fast, (otherwise) diversity-collapsed distilled model $\hat{f}_\theta$ (Gandikota et al., 13 Mar 2025): $\mathbf{x}_{t-1} = \begin{cases} f_\theta(\mathbf{x}_t, t), & t > \tau\ \hat{f}_\theta(\mathbf{x}_t, t), & t \le \tau \end{cases}$ The selection of $\tau$ is critical: DT-visualization establishes that the earliest denoising step disproportionately shapes structural sample-level diversity, justifying $\tau=1$ as optimal in practice.

2.3 Diversity-Seeking Perturbation in Ensemble Distillation

In deep ensemble distillation, the first stage is conventional one-to-one distillation (matching student logits to teachers on clean inputs). Stage two introduces diversity-revealing perturbations:

For input $x$ , select random teacher $f_r$ and guide vector $w$ .
Compute ODS direction $g = \nabla_x [w^T f_r(x)]$ , normalize to $\delta = g / \|g\|_2$ .
Perturb input: $x' = x + \eta \delta$ (step-size $\eta$ ).
Optionally scale by teacher confidence $C_{\max}(x; f_r, \tau)$ —ConfODS variant.

Student training then combines clean and perturbed KL matching for all ensemble members (Nam et al., 2021).

3. Formal Algorithmic Structure

A unified pseudocode summary for the Two-Stage Diversity-Exploring Distillation paradigm is as follows:

for each subdomain i in 1..N:
    for training step t in 1..T:
        perform supervised fine-tuning, periodically saving model M_t
        compute Pass@K(M_t; D_i) on probing set D_i
    select best checkpoint t_i^* maximizing Pass@K for S_i

merge models {M_i^*} by parameter averaging:
    theta_merged = sum_i (w_i * theta_i^*)

initialize x_T as standard
for t = T down to 1:
    if t > tau:
        x_{t-1} = f_base(x_t, t)
    else:
        x_{t-1} = f_distilled(x_t, t)
return x_0

For ensemble distillation:

On each minibatch, compute loss as sum of cross-entropy (clean input) and KL divergence (perturbed input).
ODS perturbation and ConfODS as described above are integral to stage two.

4. Diversity Metrics, Hyperparameters, and Empirical Validation

Diversity Quantification

The central operational metric is Pass@K (for sequence models and reasoning tasks): $\mathrm{Pass@}K = \mathbb{E}_{q \sim D} \left[ \Pr_{y_1, ..., y_K \sim \pi(\cdot|q)} [\exists\,k: R(q, y_k) = 1] \right]$ High Pass@K indicates many diverse, correct outputs. In diffusion models, DreamSim distance (average pairwise feature distance between samples) and FID measure sample-level diversity and realism.

Experimentally Validated Outcomes

In mathematics and code, parameter-fused Merge-SFT models achieve state-of-the-art Pass@K and also improve Pass@1 relative to standard SFT, despite greater diversity (Xu et al., 9 Nov 2025).
In diffusion, the hybrid (τ=1) method achieves FID below both the base and distilled models: FID 10.79 (hybrid) vs 12.74 (base) and 15.52 (distilled), with speed identical to distilled models (0.64s/image vs 9.22s for base, COCO-30k). DreamSim and CLIP metrics also favor the hybrid.
For deep ensemble distillation, combining ODS-based perturbation in training nearly closes the test accuracy and calibration gap to the full teacher ensemble. On CIFAR-10, BatchEns-4 student with ConfODS achieves ACC=94.01 compared to DeepEns-4's ACC=94.42, with diversity metrics significantly improved over vanilla KD (Nam et al., 2021).

Hyperparameter Regimes (for key settings):

Setting	Value(s) / Notes	Context
Subdomains N	4 (algebra, geometry, calculus, stat)	(Xu et al., 9 Nov 2025)
Pass@K probe K	64 (math), 8 (code), 16 (knowledge)	(Xu et al., 9 Nov 2025)
τ (hybrid switching step)	τ=1 (optimal by DT-visualization)	(Gandikota et al., 13 Mar 2025)
ODS step-size η	η=1/255 (fixed)	(Nam et al., 2021)
Distillation loss α	α=0.9	(Nam et al., 2021)

5. Comparative Analysis with Standard Approaches

The two-stage paradigm stands in contrast to:

Standard SFT: Only optimizes single-point loss (e.g., cross-entropy, checkpoint selection by Pass@1). Tends to collapse to dominant or most frequent solution paths. No explicit subdomain probing, model merging, or diversity metric integration.
Vanilla Ensemble Distillation: Absent diverse input generation (e.g., Gaussian noise), students absorb average ensemble function but fail to inherit teacher ensemble diversity. SFT's diversity-seeking perturbation mechanism (ODS, ConfODS) is critical for closing this gap (Nam et al., 2021).
One-shot Model Compression: Does not leverage staged diversity construction, checkpoint selection, parameter merging, or domain decomposition.

Empirically, two-stage diversity-exploring distillation consistently outperforms baseline approaches in diversity metrics (Pass@K, pairwise KL, DreamSim), sample-level coverage, and, in many cases, single-point accuracy.

6. Practical Implementation and Applications

The method is directly applicable to:

LLMs (VibeThinker-1.5B): Compact models with expert fusion after domain-wise spectrum collection, yielding reasoning capabilities comparable to much larger models (Xu et al., 9 Nov 2025).
Diffusion models: Hybrid inference pipelines that toggle between base and distilled models at early denoising steps to restore and surpass diversity, without retraining or architecture modification (Gandikota et al., 13 Mar 2025).
Deep classifier ensembles: Training student BatchEnsemble models with diversity-seeking perturbation to inherit calibration, uncertainty, and accuracy properties of teacher ensembles almost exactly (Nam et al., 2021).

Implementation requires only checkpoint management, Pass@K or diversity-driven selection criteria, parameter averaging or loss augmentation techniques, and careful hyperparameter tuning as reported.

7. Limitations, Open Directions, and Generalization

While two-stage diversity-exploring distillation consistently improves diversity and accuracy trade-offs in reported contexts, certain challenges and ambiguities remain:

Theoretical guarantees: Results are primarily empirical; while SSP is motivating, there is no formal proof of strict optimality or generalization superiority.
Fusion mechanisms: Parameter averaging works well with identical architectures (e.g., per-subdomain specialists in LLMs, U-Nets in diffusion), but transfer or fusion across mismatched models is left unaddressed.
Diversity vs. calibration: In some classifier settings, increased diversity may trade off against overconfidence or calibration; however, ODS mechanisms (particularly ConfODS) appear to mitigate such effects (Nam et al., 2021).
Hyperparameter sensitivity: Performance hinges on proper setting of diversity metrics (choice of K, weighting schemes), switching thresholds (τ), and architecture alignment.

A plausible implication is that the principles underlying two-stage diversity-exploring distillation will generalize to additional domains—such as RL policy distillation, retrieval-augmented models, and beyond—as long as diversity of solution modes is a meaningful desideratum. Future directions may include formalizing the spectrum-to-signal framework, developing automated domain decomposition for specialist selection, and exploring fusion mechanisms in non-identical architecture ensembles.

PDF Markdown Chat (Pro)

References (3)

Distilling Diversity and Control in Diffusion Models (2025)

Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model Reasoning Ability in VibeThinker-1.5B (2025)

Diversity Matters When Learning From Ensembles (2021)

Follow Topic

Get notified by email when new papers are published related to Two-Stage Diversity-Exploring Distillation (SFT).