One-step Distillation (FGM) in Generative Models

Updated 5 May 2026

One-step distillation is a technique that compresses iterative diffusion sampling into a single-step mapping, significantly accelerating generative processes.
It leverages score-matching and f-divergence based objectives to align the student network with the underlying data distribution, ensuring high sample quality.
This method is applied in advanced tasks like image synthesis, text-to-image generation, and video super-resolution, achieving performance comparable to multi-step approaches.

One-step distillation, frequently referred to in the literature as "first-generation mapping" (FGM, Editor's term), encompasses a family of techniques designed to compress the iterative sampling of diffusion and flow-based generative models into a single, non-iterative network evaluation. These methods aim to preserve or even surpass the original model's generative quality (as measured by FID and related metrics), while realizing orders-of-magnitude acceleration in sampling speed. The FGM paradigm has become central to advances in diffusion-based image synthesis, video super-resolution, text-to-image and conditional generation, and robotic visuomotor policies. The following sections cover the mathematical principles, loss constructions, theoretical unification, architectural choices, and empirical results for state-of-the-art FGM/one-step distillation methods, with a specific focus on techniques grounded in score and f-divergence matching (Zhou et al., 2024, Wang et al., 27 May 2025, Rakitin et al., 2024, Song et al., 2024).

1. Mathematical Foundations: From Iterative Diffusion to Single-Step Mapping

The canonical diffusion model parameterizes generative sampling as the (reverse) solution of an SDE or ODE of the form

$d\mathbf x_t = f_\theta(\mathbf x_t, t)\,dt + g(t)\,d\mathbf w_t \qquad\text{with final condition}\quad \mathbf x_T \sim \mathcal N(0, I),$

which requires $\mathcal O(10^{2}$ – $10^3)$ iterative denoising steps. FGM/one-step approaches aim to learn a direct mapping, $G_\theta: \mathbf z \mapsto \mathbf x$ , where $\mathbf z \sim \mathcal N(0,I)$ , that approximates the full data distribution with $\mathrm{NFE}=1$ .

The distillation task is formalized as either (i) matching the pushforward distribution $p_\theta(\mathbf x)$ of $G_\theta$ to the data $p(\mathbf x)$ (direct divergence minimization), or (ii) matching conditional and marginal statistics in noise-perturbed space, typically via score-matching or surrogate f-divergence-based objectives integrated along the forward-diffusion path (Zhou et al., 2024, Rakitin et al., 2024, Wang et al., 27 May 2025).

2. Score-Matching, Fisher Divergence, and Semi-Implicit Marginals

Central to many FGM approaches is the notion of matching the score ( $\nabla_{\mathbf x}\log p$ ) of the fake (student) distribution at intermediate noisy states $\mathcal O(10^{2}$ 0 to that of the (teacher) diffusion model:

The marginal of the data under Gaussian noising is a semi-implicit distribution: $\mathcal O(10^{2}$ 1, with $\mathcal O(10^{2}$ 2 Gaussian.
By Tweedie's formula, the denoised data mean and the score are linked: $\mathcal O(10^{2}$ 3. This relationship is leveraged in both the real (teacher) and fake (student) distributions (Zhou et al., 2024).
The key loss is a model-based Fisher divergence:

$\mathcal O(10^{2}$ 4

with $\mathcal O(10^{2}$ 5 provided by a pretrained teacher (e.g., EDM denoiser). Since $\mathcal O(10^{2}$ 6 is not directly accessible for $\mathcal O(10^{2}$ 7, various approximations and projection techniques are devised, including auxiliary score networks and linear combination of direct and projected error terms (Zhou et al., 2024).

3. f-Divergence Expansion, Unified Theory, and Loss Construction

A major theoretical advance is the unification of FGM objectives via diffusion expansion of f-divergences. Given convex $\mathcal O(10^{2}$ 8, the static divergence $\mathcal O(10^{2}$ 9 can be unfolded along the forward SDE trajectory:

$10^3)$ 0

(Wang et al., 27 May 2025). This result reveals that many apparently disparate one-step objectives—KL, reverse-KL, $10^3)$ 1-divergence, Fisher divergence—are all special cases of a shared framework:

Diff-Instruct/DMD correspond to $10^3)$ 2-divergence (pure score-norm loss).
SiD/SIM correspond to reverse-KL (score-scalar projection).
Composite $10^3)$ 3's yield mixed losses, sometimes with density-ratio estimation via a discriminator.

Practically, loss surrogates such as the Uni-Instruct loss (Wang et al., 27 May 2025) implement these expansions with tractable sample-based approximations, using auxiliary density-ratio networks and stop-gradient operations.

4. Algorithmic Implementations & Exemplary Training Protocols

FGM-based one-step distillation pipelines typically alternate updates to the student generator, auxiliary score (and, optionally, density-ratio or discriminator) networks:

The generator (student) $10^3)$ 4 receives gradients defined by the chosen surrogate (e.g., combined score-norm and projection terms (Zhou et al., 2024), RDMD with transport regularization (Rakitin et al., 2024), or the unified $10^3)$ 5-divergence surrogate (Wang et al., 27 May 2025)).
Auxiliary score matching is generally performed via standard denoising on the student's own outputs, often in latent (noised) space.
For conditional and translation tasks, perceptual or content-transport regularizers are added to preserve input semantics (Rakitin et al., 2024).
Learning rates, batch size, and time/noise schedules are dataset-dependent, with successful training regimes documented for CIFAR-10, FFHQ, and ImageNet-64.

Efficient variants such as SiD are entirely data-free, operating on self-synthesized images without any real data requirement (Zhou et al., 2024). Empirically, convergence is typically exponentially fast in the number of synthesized images, with FID improving linearly on a log-log scale.

5. Empirical Performance and Ablation Results

One-step FGM approaches now match or exceed the sample quality of multi-step teachers (sometimes with FID improvement margins), even when trained without access to real data or under severe parameter budget constraints:

Method	Dataset (NFE)	Uncond. FID	Cond. FID	Teacher FID	Note
Uni-Instruct	CIFAR-10 (1 step)	1.46	1.38	1.97	SOTA one-step; unified f-divergence
SiD	CIFAR-10 (1 step)	1.92	1.71	1.97	Data-free; fast exponential convergence
RDMD (FGM)	AFHQ Cat→Wild	6.93	–	5.40–8.87	OOD I2I; high SSIM/PSNR at low FID
DMD/MSD	ImageNet-64 (1 step)	1.20	–	1.36	Mixture-of-experts; multi-student

Ablations consistently reveal that:

Score-matching components are essential for convergence and global sample fidelity.
Proper transport or content regularization prevents semantic collapse in translation.
Data-free score-based schemes (SiD, SIM) can outperform data-dependent baselines under careful loss balancing and network initialization.

6. Practical Considerations and Extensions

FGM one-step distillation methods are now applied to:

Unpaired image-to-image translation (via regularized DMD, with explicit perceptual alignment) (Rakitin et al., 2024).
Conditional "mixture-of-experts" generators (multi-student DMD/MSD approach) for improving capacity and sample fidelity in large class-conditional or text-conditioned settings (Song et al., 2024).
Video super-resolution, using dual-stream DMD+GAN losses with advanced initialization and refinement routines (Lv et al., 23 Mar 2026).
Large-scale text-to-image (WaDi, integrating efficient parameter adaptation and direction-aware distillation) (Wang et al., 9 Mar 2026).
Direct offline mapping with offline-generated synthetic data; minimal supervision and high efficiency (GET with DEQ architectures) (Geng et al., 2023).

Typical inference times are 10–20 ms per sample (A100), a reduction of $10^3)$ 62 orders of magnitude compared to 35–1000 NFE multi-step samplers.

7. Limitations and Future Directions

While FGM/one-step methods have demonstrated SOTA FID, IS, and perceptual metrics across benchmarks, several technical caveats persist:

Quality for high-resolution or rare-mode sampling may still benefit from multi-step refinement or multi-expert architectures.
Theoretical understanding of the generalization gap relative to underlying diffusion geometry remains limited, despite recent progress (f-divergence expansion, Koopman operator) (Wang et al., 27 May 2025, Berman et al., 19 May 2025).
Adversarial training is highly effective for sharpening—recent studies expose the role of GAN components in overcoming local-minimum mismatches with KL-based distillation (Zheng et al., 11 Jun 2025).

Recent works emphasize the power of f-divergence expansion and provide a unified theory that enables further algorithmic innovation and seamless knowledge transfer across generative frameworks (Wang et al., 27 May 2025).