Abstract: Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.
The paper introduces Consistency Models, which bypass iterative diffusion sampling by learning a function that maps any point on the PF ODE trajectory directly back to the original data.
It details two training methods—Consistency Distillation and Consistency Training—that enforce self-consistency using parameterized functions and EMA-based target networks.
Practical applications include fast one-step generation and versatile image editing tasks, achieving state-of-the-art FID scores on datasets like CIFAR-10 and ImageNet.
This paper introduces Consistency Models (CMs) (Song et al., 2023), a new class of generative models designed to address the slow iterative sampling process inherent in diffusion models while retaining many of their benefits. The core idea is to learn a function that directly maps points from any time step on a Probability Flow (PF) ODE trajectory back to the trajectory's origin (the data sample).
Core Concept: Consistency Function
PF ODE: Diffusion models rely on a PF ODE (Eq. 2) that transforms data x0 into noise xT. The reverse process generates data from noise xT by solving the ODE backward.
Consistency Function: Defined as f:(xt,t)↦xϵ, where xt is a point on the ODE trajectory at time t, and xϵ is the point near the origin (data point, typically at a small time ϵ>0).
Self-Consistency Property: The defining characteristic is that for any two points (xt,t) and (xt′,t′) on the same ODE trajectory, the consistency function output is identical: f(xt,t)=f(xt′,t′)=xϵ.
Consistency Model: A parameterized function Fθ(x,t) is trained to approximate the true consistency function f by enforcing this self-consistency property.
Implementation: Parameterization
A crucial aspect is enforcing the boundary condition Fθ(x,ϵ)=x. The paper proposes and uses a practical parameterization using skip connections:
where NNθ(x,t) is a neural network (e.g., based on diffusion model architectures like U-Net), and cskip(t), cout(t) are differentiable functions satisfying:
cskip(ϵ)=1
cout(ϵ)=0
This structure ensures the boundary condition is met and allows leveraging existing diffusion model architectures. The paper uses modified versions of the scaling factors from EDM (Karras et al., 2022) to satisfy this for ϵ>0.
Implementation: Sampling
One-Step Generation: Sample noise xT∼N(0,T2I) and compute the data sample directly: x^ϵ=Fθ(xT,T). This is very fast, requiring only one network evaluation.
Multi-Step Sampling (Algorithm 1): Improves sample quality by trading compute. It involves alternating denoising steps with the CM and adding noise:
Select a time τn (from a predefined sequence T>τ1>...>τN−1>ϵ).
The sequence {τn} can be found using optimization methods like greedy ternary search to minimize FID.
Training Method 1: Consistency Distillation (CD)
This method trains a CM Fθ by distilling knowledge from a pre-trained diffusion (score) model sϕ.
Goal: Enforce Fθ(xtn+1,tn+1)≈Fθ−(x^tnϕ,tn) for adjacent points on the empirical PF ODE trajectory defined by sϕ.
Process (Algorithm 2):
Sample data x0.
Sample time index n∼U{1,...,N−1}.
Generate noisy sample xtn+1∼N(x0,tn+12I).
Use one step of a numerical ODE solver (e.g., Heun) with the score model sϕ to estimate the previous point: x^tnϕ=xtn+1+(tn−tn+1)Φ(xtn+1,tn+1;sϕ).
Minimize the consistency distillation loss (Eq. 7):
Fθ− is a target network, updated via Exponential Moving Average (EMA) of Fθ (Eq. 8). Using stop_gradient on the target network output is crucial for stability.
d(⋅,⋅) is a distance metric. LPIPS (Zhang et al., 2018) works best for images, outperforming L1 and L2.
λ(tn) is a weighting function (often set to 1).
Higher-order ODE solvers (like Heun) generally perform better than lower-order ones (like Euler) for computing x^tnϕ.
The number of discretization intervals N needs tuning (e.g., N=18 for CIFAR-10 with Heun).
Training Method 2: Consistency Training (CT)
This method trains a CM Fθ from scratch, without requiring a pre-trained diffusion model. It makes CMs an independent class of generative models.
Goal: Enforce Fθ(x0+tn+1z,tn+1)≈Fθ−(x0+tnz,tn), where z∼N(0,I).
Process (Algorithm 3): Based on the theoretical result (Theorem 2) that the CD loss approximates the CT loss (Eq. 9) for small step sizes when using Euler solver implicitly.
Sample data x0.
Sample time index n∼U{1,...,N(k)−1} (where N(k) increases during training).
Crucially uses adaptive schedules for the number of time steps N(k) and the EMA decay rate μ(k) (where k is the training step). N(k) typically starts small and increases, while μ(k) starts high (e.g., 0.9) and approaches 1. This balances convergence speed and final quality. Appendix C provides specific schedule formulas.
LPIPS is also effective here.
Practical Applications & Results
Fast Generation: CMs achieve state-of-the-art FID scores for one-step and two-step generation on CIFAR-10 (3.55/2.93 FID) and ImageNet 64x64 (6.20/4.70 FID) when trained via CD, significantly outperforming Progressive Distillation (PD).
Standalone Performance: When trained via CT, CMs outperform other one-step non-adversarial methods (VAEs, Flows) and achieve results comparable to PD without needing distillation.
Zero-Shot Data Editing: CMs inherit the editing capabilities of diffusion models. Using variations of the multi-step sampling algorithm (Algorithm 4 in Appendix), they can perform:
Inpainting: Mask unknown regions and iteratively refine using the CM.
Colorization: Treat color channels as missing information in a transformed space (e.g., YUV or using an orthogonal basis).
Super-Resolution: Treat high-frequency details as missing information in a transformed space (e.g., using patch averaging and orthogonal basis).
Stroke-guided Editing (SDEdit): Use a stroke image as the starting point xτ1 in multi-step sampling.
Denoising: Apply Fθ(xσ,σ) directly to an image xσ with noise level σ.
Interpolation: Interpolate between the initial noise vectors z1,z2 (e.g., using spherical linear interpolation) and then apply Fθ(⋅,T).
Implementation Considerations
Architecture: Can reuse U-Net architectures from diffusion models (e.g., NCSN++, ADM).
Target Network: Using an EMA target network with stop_gradient is vital for both CD and CT.
Metric: LPIPS is highly recommended for image data.
Schedules (CT): Carefully designed adaptive schedules for N and μ are important for CT performance.
Computational Cost: Training cost is comparable to training diffusion models. Inference is much faster (1 network evaluation for one-step, N evaluations for N-step).
Continuous-Time Extensions
The paper also derives continuous-time versions of the CD and CT losses (Appendix B), eliminating the need for discrete time steps tn. These objectives require calculating Jacobian-vector products, often necessitating forward-mode automatic differentiation, which might not be standard in all frameworks. Experimental results show they can work well, especially continuous-time CT, but may require careful initialization or variance reduction techniques.