Papers
Topics
Authors
Recent
2000 character limit reached

Fast-DDPM: Accelerated Diffusion Models

Updated 7 December 2025
  • Fast-DDPM is an architecture for denoising diffusion models that uses a matched set of 10 discrete timesteps to significantly reduce computational overhead in high-dimensional medical imaging.
  • The method aligns both training and inference processes, reducing training time to 20% and sampling time to 1% compared to standard 1,000-step DDPMs.
  • Empirical results in volumetric super-resolution, denoising, and translation show superior PSNR/SSIM performance and drastic speedup over traditional GAN and CNN based architectures.

Fast-DDPM is an architecture and sampling protocol for Denoising Diffusion Probabilistic Models (DDPMs) specifically engineered to overcome the significant computational bottlenecks inherent in standard diffusion-based generative methods, especially for high-dimensional, medical image-to-image generation. Fast-DDPM departs from the canonical 1,000-step training and sampling paradigm by aligning both training and inference to a drastically reduced, matched, highly optimized 10-step schedule. This approach achieves state-of-the-art generation fidelity (as measured by metrics such as PSNR/SSIM) while cutting training time to approximately 20% and sampling time to 1% of the standard DDPM baseline, as validated across volumetric super-resolution, denoising, and translation tasks in 3D/4D medical imaging (Jiang et al., 23 May 2024).

1. Motivation and Core Principles

The principal challenge addressed by Fast-DDPM is the prohibitive cost of applying standard DDPMs to medical volumes, which are often three- or four-dimensional and require days to weeks for training and minutes to hours for sampling a single image. These constraints are largely a result of the 1,000-step Markov diffusion/sampling chains, which are not coordinated between training and inference: standard approaches train across all 1,000 noise scales but often use only a small subset during sampling, leading to severe computational waste and suboptimal step utilization (Jiang et al., 23 May 2024).

Fast-DDPM solves this by:

  • Restricting both training and sampling to a common set of 10 discrete timesteps.
  • Designing two noise schedulers (uniform and non-uniform over noise level) for versatility.
  • Directly aligning the denoiser’s capacity with the actual inference trajectory, eliminating the training/sampling mismatch prevalent in DDIM/PLMS/PNDM/DDPM-Solver approaches that rely on post-training schedule subsampling.

2. Architecture and Conditioning Schemes

The denoiser ϵθ\epsilon_\theta in Fast-DDPM retains the established U-Net backbone, adopting modality-appropriate dimensionality:

  • In 2D, each block uses Conv2D + GroupNorm + ReLU.
  • For 3D or 4D (e.g. temporal) data, these become Conv3D/Conv4D, with GroupNorm and ReLU or Swish activations.
  • Down-/up-sampling blocks concatenate conditional images or feature maps (such as additional slices or multi-contrast channels) in the channel dimension and can incorporate FiLM-style scaling for richer conditioning.
  • For volumetric input, skip connections propagate fine-structure through encoder-decoder levels, with each stage concatenating conditional features channel-wise (Jiang et al., 23 May 2024).

Normalized input volumes (e.g., 256×256×Nslices256\times256\times N_{\mathrm{slices}} into [1,1][-1,1]) and flexible condition fusion strategies ensure architectural adaptability across MRI, CT, and other domain-specific flows.

3. Diffusion Process, Noise Schedule, and Loss

Fast-DDPM defines its noise schedule by subsampling a smooth “master” α2(t)\alpha^2(t) curve, itself defined as

α2(t)=j=11000(1β(j/1000)), with β(u)=0.0001+(0.020.0001)u\alpha^2(t) = \prod_{j=1}^{1000} \left(1 - \beta(j/1000)\right), \text{ with } \beta(u) = 0.0001 + (0.02-0.0001) u

Evaluated at ten points tit_i (either ti=i/10t_i = i/10 for uniform or denser at high noise for non-uniform), the schedule yields

αi=α2(ti),σi=1α2(ti)\alpha_i = \sqrt{\alpha^2(t_i)}, \quad \sigma_i = \sqrt{1-\alpha^2(t_i)}

The one-step forward kernel and marginal transitions are

q(xixi1)=N(1βixi1,βiI),    βi=αi12αi2q\left(x_i|x_{i-1}\right) = \mathcal N\left(\sqrt{1-\beta_i} x_{i-1}, \beta_i \mathbf{I}\right), \;\; \beta_i = \alpha^2_{i-1} - \alpha^2_i

q(xix0)=N(αˉix0,(1αˉi)I),    αˉi=j=1iαj2q(x_i \mid x_0) = \mathcal N\left(\sqrt{\bar\alpha_i} x_0, (1-\bar\alpha_i) \mathbf{I}\right), \;\; \bar\alpha_i = \prod_{j=1}^i \alpha^2_j

as in canonical DDPMs, but operating solely on these 10 selected scales.

Training minimizes the standard MSE in the predicted Gaussian noise: L(θ)=E(x0,c),i{1,,10},ϵN(0,I)ϵϵθ(αix0+σiϵ,c,i)2L(\theta) = \mathbb{E}_{(x_0, c),\, i\in\{1,\dots,10\},\, \epsilon\sim\mathcal{N}(0,I)} \|\epsilon - \epsilon_\theta( \alpha_i x_0 + \sigma_i \epsilon,\, c,\, i )\|^2 This loss focuses network capacity exclusively on the actually utilized noise levels, maximizing training impact per step.

4. Sampling Procedure and Step Alignment

Sampling runs the following 10-step iterative scheme:

1
2
3
4
5
6
x_10  N(0, I)
for i = 10 down to 1:
    t = i / 10  # or non-uniform grid
    x_{i-1} = (α_{i-1}/α_i) * x_i 
              + [σ_{i-1} - (α_{i-1}/α_i) * σ_i] * ε_θ(x_i, c, i)
return x_0
Because the network is trained and sampled on the identical collection {αi,σi}i=110\{\alpha_i, \sigma_i\}_{i=1}^{10}, no noise levels are “wasted” on untrained regimes, and there is no mismatch—contrary to naive DDPM or mismatched DDIM/PLMS samplers. This principle underpins the extreme acceleration: sample generation converts from a $1,000$-step, tens-of-seconds (to hours for 3D) process to a 10-step, second-scale operation with fidelity parity or improvement (Jiang et al., 23 May 2024).

5. Specialized Adaptations for High-Dimensional Medical Data

In 3D/4D instantiations:

  • The U-Net blocks are reconfigured to Conv3D (or Conv4D) layers with 333^3 (resp. 343^4) kernels, stride-2 downsampling, and learned transpose-conv upsampling.
  • Skip connections propagate all spatial resolutions.
  • Conditional data—a stack of adjacent slices, modalities, or contrasts—is encoded as multi-channel input, concatenated at each architecture level.
  • Input volumes are normalized slice-wise, and the model is trained to reconstruct highly structured anatomical details.
  • For tasks such as volumetric super-resolution, image denoising, and translation, Fast-DDPM rigorously outperforms baseline convolutional and GAN-based architectures on both perceptual (SSIM) and distortion (PSNR) metrics (Jiang et al., 23 May 2024).

6. Empirical Efficiency and Benchmark Results

Reported benchmarks show, on multi-volume super-resolution:

  • DDPM training $136$ h \rightarrow Fast-DDPM $26$ h (0.190.2×0.19 \approx 0.2\times).
  • DDPM sampling $3.7$ min/volume \rightarrow Fast-DDPM $2.3$ s/volume (0.01×0.01\times).
  • CT denoising and MRI translation exhibit similar speedup (5×\sim 5\times training, 100×\sim 100\times sampling).
  • Across all tasks, Fast-DDPM achieves superior PSNR/SSIM and outperforms both classic and SOTA GAN/CNN methods (Jiang et al., 23 May 2024).
Task Training Time Sampling Time PSNR/SSIM Relative Speedup
Standard DDPM 136 h 3.7 min SOTA, baseline
Fast-DDPM 26 h 2.3 s Higher \sim0.2× train/\sim0.01× sample

7. Broader Impacts, Limitations, and Confirmatory Studies

The Fast-DDPM approach has catalyzed further work in fast, domain-adapted diffusion architectures:

  • Lung-DDPM+ replaces the standard 1,000-step DDPM with domain-conditioned, high-order ODE solvers, achieving 8×8\times fewer FLOPs, 6.8×6.8\times lower memory, and 14×14\times faster sampling while preserving segmentation and visual metrics (Jiang et al., 12 Aug 2025).
  • Minutes to Seconds applies a similar paradigm for 2D inpainting, combining reduced-parameter networks, skip-step DDIM sampling, and a two-stage (coarse-resolve/fine-refine) process for 60×60\times acceleration without significant drop in LPIPS or SSIM (Zhang et al., 8 Jul 2024).
  • These architectures validate the transferability of the Fast-DDPM principle—full-step schedule matching, ultra-low NFE, and network-targeted efficiency gains—to diverse generation and restoration domains.

A potential limitation is the focus on one specific set of 10 schedule points; while high performance is retained across tasks, extremely nonstationary noise characteristics or highly atypical conditioning may require reoptimization of the schedule. Nonetheless, Fast-DDPM sets a new standard for computationally efficient, high-fidelity generative modeling in high-dimensional spaces (Jiang et al., 23 May 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Fast-DDPM Architecture.