One-Step Diffusion Model Training

Updated 17 November 2025

One-step diffusion model training is a generative approach that synthesizes high-fidelity outputs in a single network evaluation, bypassing multi-step denoising.
It leverages diverse training paradigms—including score distillation, distributional matching, and GAN-based fine-tuning—to approximate teacher diffusion processes.
Practical implementations rely on initialization from pretrained multi-step models and freezing strategies to maintain feature richness and stability.

A one-step diffusion model (DM) is a neural generative architecture or training objective that enables synthesis of high-fidelity samples in a single network evaluation, sidestepping the traditionally slow, multi-step denoising process of classical diffusion models. Recent advances have made one-step DMs competitive with multi-step diffusion, through theoretical unification, adversarial and distributional objectives, knowledge distillation, and advanced architectural adaptations. This article systematically details the foundations, theoretical frameworks, prominent training paradigms, major algorithmic advances, state-of-the-art empirical results, and remaining challenges in one-step diffusion model training.

1. Theoretical Foundations: f-Divergence Expansion and Unification

The development of one-step DM training is fundamentally driven by the difficulty of minimizing divergences between a one-step student generator's output distribution and the teacher diffusion process's marginal. Naively, this minimization is intractable due to the lack of density or score access for the student at arbitrary noise levels. Uni-Instruct (Wang et al., 27 May 2025) establishes an encompassing theory based on the diffusion expansion of integral $f$ -divergences:

$\mathcal{D}_f(q_0 \| p_\theta) = \int_0^T \tfrac12 g^2(t) \, \mathbb{E}_{x_t \sim p_{\theta,t}} \left[ r_t(x_t)^2 f''(r_t(x_t)) \| s_{p_{\theta,t}}(x_t) - s_{q_t}(x_t) \|^2 \right] dt,$

where $r_t(x) = q_t(x)/p_{\theta,t}(x)$ and $s_{p_t}(x)$ denotes the score function. Through a sequence of gradient-equivalence theorems, this form can be translated into tractable losses involving only student and teacher score networks, with suitable surrogate estimators for density ratios and score mismatches. This generalized framework recovers nearly all published single-step diffusion training methods (KL, $\chi^2$ , JS, Fisher, etc.) as special cases (Wang et al., 27 May 2025).

2. Training Paradigms: Distributional, Adversarial, and Score-based Schemes

A diversity of one-step DM training pipelines have emerged, which all map the multi-step teacher process to a single forward pass but differ in choice of objective and supervision mechanism:

Integral Score-based Distillation: Methods such as Diff-Instruct (Luo et al., 2023) and SIM (Wang et al., 27 May 2025) minimize time-integrated discrepancies between student and teacher scores over the diffusion trajectory via a tractable surrogate:

$\nabla_\theta L = \int_0^{T} w(t)\, \mathbb{E}_{x_t} [(s_{\text{teach}}(x_t,t) - s_{\text{stud}}(x_t,t)) \partial x_t/\partial\theta]\,dt$

where $x_t$ is sampled from the student generator and then diffused.

Explicit Distributional Matching: DMD (Yin et al., 2023) directly minimizes $\mathrm{KL}(p_{\text{teacher}} \| p_{\theta})$ , estimating the necessary scores by training auxiliary denoisers and combining with a regression loss that anchors the generator to the large-scale teacher outputs.
GAN-based Fine-Tuning: GDD (Zheng et al., 2024) and D2O/D2O-F (Zheng et al., 11 Jun 2025) adopt a generative adversarial perspective, matching the data (or teacher) distribution by adversarial training of the generator, with either all or the majority of model parameters frozen. This approach sidesteps hazardous sample-wise regression losses entirely.
Score-Free Ratio-Based Methods: As demonstrated in (Zhang et al., 11 Feb 2025), the need for teacher score supervision can be bypassed by learning density-ratio estimators via binary discrimination between real (teacher) and generated (student) samples at each diffusion time, with generator updates driven by the learned ratio gradients.
Adversarial Consistency and Self-Consistency: ACT (Kong et al., 2023) incorporates discriminators directly in consistency training loops to minimize JS divergence at each time, while shortcut and rectified flow models (Frans et al., 2024, Zhu et al., 2024) devise self-consistency or flow-matching objectives that allow explicit, controllable step-count reduction.

3. Practical Training Algorithms and Architectures

Across these paradigms, effective one-step DM training exhibits consistent high-level features:

Initialization: The student generator is typically initialized from a pretrained multi-step DM at an intermediate or final timestep, essential for preserving rich, multi-scale learned features and preventing mode collapse (Zhang et al., 11 Feb 2025, Zheng et al., 2024).
Score Networks and Discriminators: Auxiliary networks (student scores, "fake" denoisers, or GAN discriminators) are often required for density ratio estimation, gradient computation, or stabilization. Stop-gradient operations, EMA updates, and appropriate regularization are critical for stable convergence (Wang et al., 27 May 2025, Yin et al., 2023).
Freezing and Parameter Sharing: Overparameterization is addressed by freezing major portions of the pretrained weights (typically convs), retraining only norm, skip, or attention-projection layers, which unlocks the model’s capacity for "innate" single-step synthesis (Zheng et al., 2024, Zheng et al., 11 Jun 2025).
Data and Computation Regimes: Leading recipes leverage either large, diverse teacher-generated trajectory pairs, or efficiently sampled real data; compute budgets range from tens of GPU-days for high-resolution domains (Lin et al., 14 Jan 2025) to minutes for CIFAR-10 (Geng et al., 2023). GAN-based and distributional approaches, in particular, minimize data redundancy and training expense (Zheng et al., 2024, Zheng et al., 11 Jun 2025).
Conditionality and Guidance: Advanced text/image/video conditional synthesis requires specialized conditioning (e.g., classifier-free guidance spanning random scales for improved stability (Nguyen et al., 2024), or NASA negative prompt attention modules for enhanced controllability (Nguyen et al., 2024)).

4. Empirical Performance across Domains

Recent one-step DMs now rival or surpass multi-step diffusion generators on standard image, video, and structured data benchmarks:

Method	Domain	FID (single-step)	Teacher FID (multi-step)	Notes
Uni-Instruct (JKL) (Wang et al., 27 May 2025)	CIFAR-10 32x32	1.46	1.97	SOTA, unified theory
Uni-Instruct (FKL) (Wang et al., 27 May 2025)	ImageNet 64x64	1.02	2.35	Exceeds 79-step teacher
GDD-I (Zheng et al., 2024)	CIFAR-10 32x32	1.54	1.98	Pure GAN loss/freeze
SwiftBrush-v2 (Dao et al., 2024)	T2I, COCO	8.14	9.64 (SD2.1)	One-step > multi-step
DMD (Yin et al., 2023)	ImageNet 64x64	2.62	2.32	Outperforms few-step
DOVE (Chen et al., 22 May 2025)	VSR, Real videos	—	—	28× speedup w/ parity
SlimFlow (Zhu et al., 2024)	CIFAR-10 32x32	5.02 (15.7M params)	—	Compression + 1-step

On harder text-to-image and text-to-3D tasks, methods such as SNOOPI (Nguyen et al., 2024) and Uni-Instruct (Wang et al., 27 May 2025) report high human preference scores and diversity/precision metrics, setting new records for one-step models. In speech enhancement (Lay et al., 2023), two-stage schemes keep one-step prediction error on par with or exceeding baselines.

5. Domain-Specific Extensions: Video, Motion, and Robotics

One-step DM pipelines have been extended to temporally structured and robotic domains:

Video Super-Resolution: DOVE (Chen et al., 22 May 2025) applies a two-stage latent→pixel regression scheme to adapt a pretrained T2V diffusion backbone to efficient VSR without specialized modules, and uses a tailored high-quality video dataset (HQ-VSR) for fine-tuning.
Video Generation: Seaweed-APT (Lin et al., 14 Jan 2025) achieves real-time 1280x720, 24fps one-step video synthesis by adversarial post-training following diffusion distillation, with critical architectural and regularization adaptations to stabilize training at high capacity.
Human Motion Prediction: A two-stage, knowledge-distillation and Bayesian optimization pipeline enables MLP-only, real-time one-step DMs for 3D pose chains, matching multi-step accuracy at >50 Hz throughput (Tian et al., 2024).
Robot Control: Flow-matching shortcut models (Frans et al., 2024) capture high-level transition policies capable of one-step action diffusion synthesis in robotics tasks.

6. Limitations and Open Challenges

Despite remarkable progress, several technical challenges remain:

Ratio/Score Estimation: Highly accurate score or density ratio estimation across the diffusion trajectory is computationally burdensome and remains unstable for high-resolution or conditional domains (Wang et al., 27 May 2025, Yin et al., 2023).
Capacity and Scalability: Compressing extensive multi-step denoising dynamics into a shallow one-step mapping induces sharp bottlenecks in sample quality, especially for structured signals where long-range dependencies are crucial (Lin et al., 14 Jan 2025, Zhu et al., 2024).
Mode Diversity and Collapse: Proper initialization and architectural design (feature richness, multi-task blocks, freezing) are required to avoid severe mode-collapse and maintain generative diversity (Zhang et al., 11 Feb 2025).
Adaptation to New Modalities: While most methods extend in principle to T2I, T2V, and structured prediction, cross-modal adaptation often demands careful redesign of conditioning, regularization, and guidance mechanisms (Nguyen et al., 2024, Wang et al., 27 May 2025).
Interpretability: Frequency-domain analyses (Zheng et al., 11 Jun 2025) suggest diffusion pre-training endows models with frequency specialization, but how best to exploit or extend such representations for more diverse or compositional synthesis tasks is poorly understood.

7. Outlook and Future Directions

The unified $f$ -divergence expansion framework (Wang et al., 27 May 2025) provides both a theoretical foundation and a taxonomy for all known one-step DM training strategies. It enables principled design and benchmarking of new approaches, such as:

Automated divergence scheduling or adaptive $f$ -selection
Ratio-free or self-supervised architectures for high-dimensional, conditional modalities
Efficient, on-the-fly ratio/score surrogates to further accelerate and stabilize training for large domains
Deeper analysis of architectural bottlenecks (block-wise or frequency domain) to improve capacity and interpretability

As a result, one-step DM training is now a mature, theoretically grounded field with robust, efficient recipes for diverse generative modeling domains. Continued advances are anticipated in scalability, multimodality, and interpretability, driven by both theoretical insight and empirical innovation.