Fast Diffusion Model (FDM)

Updated 3 June 2026

Fast Diffusion Model is a framework that unifies rapid physical diffusion in PDEs with algorithmic acceleration in generative models.
It utilizes momentum-based updates, multistage architectures, and knowledge distillation to significantly reduce computational cost and sampling iterations.
Its applications span mathematical physics, image synthesis, robotic control, and reaction–diffusion systems, highlighting both theoretical insights and practical impact.

A Fast Diffusion Model (FDM) refers to a class of models or methods—both in partial differential equations (PDEs) and modern machine learning generative models—that embody or exploit rapid diffusive dynamics, either in the physical/material sense (e.g., in mathematical physics and interacting particle systems) or in algorithmic terms, to achieve accelerated training or sampling in generative modeling. The term FDM commonly applies within two distinct contexts: (1) as the fast diffusion equation (FDE) and its microscopic models in mathematical physics, and (2) as a terminology for algorithmically accelerated generative diffusion models in deep learning. This article systematically presents both regimes, explaining physical, mathematical, and algorithmic foundations, and their contemporary research and application trajectories.

1. Fast Diffusion in Mathematical Physics and Hydrodynamic Limits

In the context of interacting particle systems and nonlinear PDEs, the Fast Diffusion Model addresses continuum limits in which diffusion occurs anomalously rapidly, often with a diffusion coefficient that is a nonlinear function of local density. The canonical family is governed by the nonlinear equation

$\partial_t \rho = \nabla \bigl( D(\rho) \nabla \rho \bigr)$

where the diffusion coefficient takes the form

$D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$

For $m \in (0,1)$ this yields the fast diffusion equation, while $m \in (1,2)$ recovers the porous medium equation. The critical case $m=1$ reduces to classical linear diffusion. The fast diffusion regime ( $m<1$ ) features superdiffusive smoothing, where solutions exhibit finite time extinction, expanded propagation, and weaker regularity compared to slow/porous cases.

The underlying microscopic construction (Gonçalves et al., 2023) is a nearest-neighbor exclusion process on the $N$ -site torus $\mathbb{T}_N$ , where particle jump rates are modulated by alternating sums of occupancy "flipped" constraints, generated via a binomial expansion. This model continuously interpolates from SSEP ( $m=1$ ) to fast diffusion ( $m\to 0$ ) by adjusting the binomial coefficients and truncating the local kinetic constraints. The entropy method, combined with new space–time averaging replacement lemmas, yields the hydrodynamic limit and establishes that the empirical density converges weakly to the solution of the fast diffusion PDE. In the fast case, proving regularity, energy dissipation, and uniqueness involve new estimates—most notably, a tight bound on the Dirichlet form, and a Taylor expansion argument for $D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$ 0 controlling solution properties in the singular regime $D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$ 1.

2. Momentum-Accelerated Diffusion Models in Generative Learning

The term Fast Diffusion Model (FDM) in deep generative modeling encompasses algorithm modifications that accelerate the convergence of diffusion models by introducing explicit momentum or otherwise modifying the kernel of the forward process (Wu et al., 2023). Drawing analogy to heavy-ball/momentum stochastic gradient descent (SGD), the FDM augments the forward diffusion update with a memory term: $D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$ 2 This recasts forward-time DDPM as a damped oscillation system, with the critical damping choice yielding the fastest non-oscillatory convergence. The resulting forward kernel mean,

$D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$ 3

replaces the standard exponential decay, leading to a more direct convergence toward the target distribution and, by extension, higher sample efficiency. Empirical results show halving of training cost and a threefold decrease in necessary sampling steps to achieve the same FID as standard baseline models, across signal parameterizations (VP, VE, EDM) and datasets (CIFAR-10, FFHQ, AFHQv2). The FDM approach is model-agnostic: the closed-form kernel is inserted into any score-based or denoising training pipeline, without additional network parameters or memory overhead.

3. Multistage and Structure-Preserving Fast Diffusion Architectures

Variants of fast diffusion in deep learning employ architectural strategies for enhanced generative efficiency. The "f-DM" model (Gu et al., 2022) formalizes a progressive multistage diffusion whereby signal transformations (e.g., downsampling, blurring, learned encoders) structure the forward process. At each stage $D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$ 4, diffusion is performed not in the original sample space but on a transformed or abstracted latent (possibly hierarchical or VAE-encoded). Noise schedules are adjusted at resolution boundaries via explicit rescaling, preserving signal-to-noise ratio and controlling variance upon upsampling. This staged architectural design, including both hand-designed (downsampling) and learned (VQ-VAE or VQGAN) transformations, allows f-DM to operate efficiently with $D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$ 52–3 $D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$ 6 speed-up relative to single-stage DDPMs, with sample quality (FID, precision/recall) at or above baseline.

4. Knowledge Distillation and Few-Step Inference

Recent fast diffusion models achieve major acceleration through progressive knowledge distillation or teacher–student paradigms, effective at collapsing hundreds of sampling iterations into a handful or even a single step.

In "ProDiff" (Huang et al., 2022), applied to text-to-speech, a generator-based denoiser is trained to predict the clean target (mel-spectrogram) rather than the gradient, which is empirically more robust under aggressive (e.g., 2-step) sampling. By distilling a $D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$ 7-step teacher into an $D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$ 8-step student, and using teacher outputs as supervision rather than ground-truth audio, ProDiff eliminates high-variance targets and achieves 24 $D(\rho) = m \rho^{m-1}, \qquad m \in (0,2]$ 9 real-time inference with competitive quality (MOS, MCD, STOI, PESQ) and diversity (NDB/JS).

For long-sequence discrete data, FS-DFM (Monsefi et al., 24 Sep 2025) introduces "Few-Step Discrete Flow-Matching," where the model is explicitly conditioned on the number of allowed sampling steps, with Runge-Kutta shortcut teachers enforcing self-consistency over large jumps. By constructing a cumulative-scalar update, FS-DFM attains GPT-2-level perplexity on 1,024-token sequences in only 8 steps, a $m \in (0,1)$ 0128 $m \in (0,1)$ 1 throughput gain, without sacrificing entropy or masked-reconstruction accuracy.

Robot control models (FODMP, (Shi et al., 25 Mar 2026)) further distill multi-step diffusion into a one-step consistency decoder for trajectory generation in dynamic settings, outperforming previous chunking and trajectory-level diffusion in both speed (10 $m \in (0,1)$ 2 over MPD) and real-robot success rates.

5. Feature Caching, Speculation, and Inference Acceleration

Algorithmic FDMs also exploit architectural or information-theoretic speedups at inference. SpecDiff (Pan et al., 17 Sep 2025), for instance, integrates self-speculative mini-iterations, dynamically scoring tokens for recomputation, caching, or approximation within transformer-based diffusion (DiT). The core mechanism extracts both historical and future (speculative) importance for each token, allocating computation adaptively to maximize quality retention at given compute budgets. This triage, combined with speculative attention and multi-level token splitting (full compute, reuse, Taylor-approximate), delivers 2.4–3.2 $m \in (0,1)$ 3 end-to-end speedups on large-scale diffusion models (Stable Diffusion 3/3.5, FLUX), with FID/CLIP/VQA metric drops limited to $m \in (0,1)$ 45%.

Analytical diffusion frameworks (GoldDiff, (Shang et al., 18 Feb 2026)) speed up empirical-Bayes denoising by dynamically restricting retrieval to a "Golden Subset" of neighbors with non-negligible posterior mass, adjusting subset sizes in response to signal-to-noise. This reduces amortized complexity from $m \in (0,1)$ 5 to $m \in (0,1)$ 6 per step (with $m \in (0,1)$ 7), scaling analytical diffusion to datasets such as ImageNet-1K with 71 $m \in (0,1)$ 8 speedup and no qualitative performance loss.

6. Fast Diffusion in Complex Systems and Reaction–Diffusion Fronts

In spatial ecology and reaction–diffusion modeling, "fast diffusion" describes enhanced propagation speeds in multi-component environments, e.g., a reaction–diffusion field coupled to a 1D "road" of high diffusivity (Dietrich et al., 2016). The system couples $m \in (0,1)$ 9 on the road (with $m \in (1,2)$ 0) and $m \in (1,2)$ 1 in the field via Robin-type boundary exchange. The model demonstrates that, under both Fisher–KPP and ignition-type nonlinearities, the asymptotic spreading velocity is $m \in (1,2)$ 2, independent of the reaction form, with $m \in (1,2)$ 3 calculable from eigenvalue asymptotics. Further, there exists a two-speed regime in the ignition case: an $m \in (1,2)$ 4 speed pre-ignition followed by a rapid $m \in (1,2)$ 5-law front post-ignition. This result provides theoretical underpinning for accelerated invasion phenomena in structured spatial domains—a direct physical analog to rapid mixing and mass transport in mathematical modeling.

7. Applications, Limitations, and Future Research

Fast Diffusion Models exhibit broad application: image synthesis, high-fidelity speech generation, real-time robotic planning, high-energy physics simulations, algorithmic biology, and beyond. Each approach presents specific merits and limitations (e.g., requirements of a suitable kernel, data-dependence of analytical denoisers, or the need for specialized training/dataset architectures). Possible extensions include more general or learned schedule optimization, scaling knowledge distillation to multimodal settings, or further translation of physical-model-based FDEs into robust regularizers or priors in data-driven pipelines.

Key advances—momentum-based kernel design, multistage transformation, teacher–student compression, subset retrieval, feature speculation—continue to push inference and training cost to practical, real-time regimes while preserving or advancing generative sample quality (Wu et al., 2023, Gu et al., 2022, Huang et al., 2022, Monsefi et al., 24 Sep 2025, Shi et al., 25 Mar 2026, Pan et al., 17 Sep 2025, Shang et al., 18 Feb 2026, Dietrich et al., 2016, Gonçalves et al., 2023).

References: