Papers
Topics
Authors
Recent
Search
2000 character limit reached

PCMs: Deterministic Few-Step Diffusion Models

Updated 25 April 2026
  • Phased Consistency Models (PCMs) are a refinement of self-distilled diffusion models that partition the diffusion process into distinct temporal phases to improve consistency, controllability, and efficiency.
  • They employ phase-specific solvers trained to mimic teacher trajectories, effectively decoupling classifier-free guidance from the distillation objective and enabling deterministic few-step inference.
  • PCMs achieve state-of-the-art performance in text-to-image, text-to-video, and text-to-motion generation, as evidenced by superior FID scores and real-time throughput.

Phased Consistency Models (PCMs) are a refinement of self-distilled diffusion generative models that enable high-quality sampling in few deterministic steps, overcoming the key limitations of prior latent consistency techniques for both text-to-image and text-to-motion tasks. PCMs partition the diffusion process into multiple temporal phases and train interval-specific consistency functions, resulting in deterministically composable solvers that preserve sample quality, text alignment, and guidance flexibility in unprecedentedly low-step regimes. Applications include state-of-the-art image, video, and human motion generation with excellent computational efficiency and controllability (Wang et al., 2024, Jiang et al., 31 Jan 2025).

1. Mathematical Definition and Core Objectives

A standard diffusion process is described in terms of a forward SDE in latent or pixel space:

dxt=ftxtdt+gtdwt,xt=αtx0+σtϵ,ϵN(0,I)dx_t = f_t x_t\, dt + g_t\, dw_t,\qquad x_t = \alpha_t x_0 + \sigma_t \epsilon,\quad \epsilon\sim\mathcal{N}(0, I)

The associated probability-flow ODE (PF-ODE) enables deterministic transformation from noise to data:

dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt

Consistency Models (CMs) learn a single solver that, for all t,t[ϵ,T]t,t' \in [\epsilon, T], maps any noise level xtx_t back to the clean data point xϵx_\epsilon. In contrast, Phased Consistency Models (PCMs) partition [ϵ,T][\epsilon, T] into MM sub-intervals defined by edge times s0=ϵ<s1<<sM=Ts_0 = \epsilon < s_1 < \dots < s_M = T. For each interval [sm,sm+1][s_m, s_{m+1}], a separate function fθm(xt,t)f_\theta^m(x_t, t) is trained to recover the start state dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt0 of that interval for all dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt1 within the interval:

dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt2

Chaining these dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt3 solvers acts as a deterministic composition dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt4, mapping dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt5 in exactly dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt6 neural network calls without stochasticity or noise reinjection (Wang et al., 2024, Jiang et al., 31 Jan 2025).

2. Theoretical Formulation and Training Objectives

PCMs leverage the exact ODE solution between times dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt7:

dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt8

where dxt=[ftxt12gt2xlogpt(x)]dtdx_t = \left[f_t x_t - \frac{1}{2}g_t^2 \nabla_x \log p_t(x)\right] dt9 and t,t[ϵ,T]t,t' \in [\epsilon, T]0 is the time corresponding to t,t[ϵ,T]t,t' \in [\epsilon, T]1. Replacing the true score function with a network t,t[ϵ,T]t,t' \in [\epsilon, T]2, and approximating the integral, the PCM one-shot solver becomes:

t,t[ϵ,T]t,t' \in [\epsilon, T]3

with t,t[ϵ,T]t,t' \in [\epsilon, T]4. This is algebraically equivalent to the deterministic DDIM update under appropriate score function estimation (Theorem 4.1 in (Wang et al., 2024)).

For phase-wise learning, the Phased Consistency Distillation (PCD) loss is defined, sampling random state pairs in a sub-interval and matching student and teacher solution trajectories:

t,t[ϵ,T]t,t' \in [\epsilon, T]5

where t,t[ϵ,T]t,t' \in [\epsilon, T]6 is an EMA target, t,t[ϵ,T]t,t' \in [\epsilon, T]7 is a distance (L2 or Huber), and t,t[ϵ,T]t,t' \in [\epsilon, T]8 is a weighting. Low-step regimes additionally use an adversarial “consistency discriminator” penalty governed by a GAN-style hinge loss for further sample refinement (Wang et al., 2024, Jiang et al., 31 Jan 2025).

3. Addressing Limitations of Latent Consistency Models

PCMs were developed to address three principal flaws in Latent Consistency Model (LCM) design (Wang et al., 2024):

  • Consistency: LCMs rely on stochastic multi-step samplers with new noise injected at each step, resulting in output instability and variability as the number of steps changes.
  • Controllability: Since classifier-free guidance (CFG) weights are baked into the one-phase distillation, LCMs tolerate only very small guidance scales, and show negligible response to negative prompts.
  • Efficiency: In the low-step regime t,t[ϵ,T]t,t' \in [\epsilon, T]9, the loss is too coarse to support fine-grained generation, leading to a sharp degradation in sample quality.

By explicitly phasing the trajectory, PCMs restore determinism in multi-step inference, separate the guidance weights from the distillation objective (enabling full CFG/negative-prompt flexibility), and allow for task-specific solvers and consistency losses that preserve fidelity even at xtx_t0 steps (Wang et al., 2024, Jiang et al., 31 Jan 2025).

4. Training Algorithms and Guidance Integration

PCM training is performed by distilling from a frozen “teacher” diffusion model (pretrained, e.g., StableDiffusion or latent motion predictor) using multi-phase consistency objectives. Each training iteration samples data, simulates diffusion/noising to a random time within a sub-interval, computes teacher and student solutions from that state, and penalizes their discrepancy. EMA stabilization and adversarial losses are used throughout (Wang et al., 2024, Jiang et al., 31 Jan 2025).

Conditioned CFG-guided teacher ODE solutions allow the decoupling of guidance scale from the PCM phase solvers, so phase-specific consistency models can be applied with arbitrary CFG weights at sampling time. For video generation, the method is extended to a spatiotemporal 3D U-Net by inflating the 2D model and sequentially distilling the image PCM to the video PCM, using identical phased frameworks, loss forms, and discriminator configurations (Wang et al., 2024).

5. Practical Sampling, Efficiency, and Real-Time Implementation

The PCM inference procedure is deterministic and requires exactly xtx_t1 network calls for xtx_t2 phases:

  • Sample initial noise xtx_t3
  • For xtx_t4 downto xtx_t5: xtx_t6
  • Decode xtx_t7 (typically xtx_t8) via VAE or generator head

Stochastic variants can be realized by interpolating the network prediction with fresh noise: xtx_t9, xϵx_\epsilon0 for sample diversity (Wang et al., 2024). In MotionPCM, all operations occur in compressed latent space, and no random noise is injected between phases, which, combined with chainable single-call solvers, yields real-time throughput (xϵx_\epsilon1 FPS for xϵx_\epsilon2) (Jiang et al., 31 Jan 2025).

Table: Inference Speed and Quality (HumanML3D, Motion Synthesis) (Jiang et al., 31 Jan 2025)

Method FID (1-step) FID (4-step) Inference Time (s)
PCM (MotionPCM) 0.054 0.036 0.031 (1-step)
MotionLCM-V2 0.072 0.056 0.046 (4-step)
DDPM/DDIM 0.2–0.6 (100-step)

Comparable improvements are seen in image/video PCMs, with PCM@1-step achieving FID(SD)xϵx_\epsilon38.27 versus LCMxϵx_\epsilon453.43, and PCM@4-step scoring FIDxϵx_\epsilon55.81 on COCO-30K, outperforming all tested baselines (Wang et al., 2024).

6. Applications, Generalization, and Empirical Results

PCMs have been instantiated in multiple domains:

  • Text-to-image (COCO-30K/CC3M datasets): PCM shows lower FID and higher CLIP-Score than LCM, CTM, InstaFlow, and SD-Turbo in the 1–16 step regime, with marked improvements in sample consistency and negative-prompt controllability.
  • Text-to-video (WebVid/UCF101): Inflated 3D PCM matches or exceeds AnimateLCM in CLIP-Score and temporal flow, and achieves higher consistency across steps.
  • Text-to-motion (HumanML3D): MotionPCM generates human motion at xϵx_\epsilon6 FPS with a 38.9% improvement in FID over prior best, robustly capturing multi-stage and subtle motion cues that prior CM/LCM approaches fail to reproduce.

Ablation studies demonstrate that phase number xϵx_\epsilon7 suffices to restore CFG flexibility, the use of latent-space discriminators improves stability, and adversarial consistency penalties are critical for eliminating low-step artifacts (Wang et al., 2024, Jiang et al., 31 Jan 2025).

7. Architectural and Implementation Highlights

PCM architectures utilize the pretrained teacher backbone for both consistency distillation and as a frozen latent-space discriminator head. In practice, the teacher and student share a U-Net structure, initialized from the diffusion teacher. The PCM training typically employs AdamW, cosine learning rate decay, batch sizes of xϵx_\epsilon8128, and EMA rates of 0.95; phase boundaries {s_m} are selected in original diffusion schedule space, commonly spaced linearly in xϵx_\epsilon9 or [ϵ,T][\epsilon, T]0 (Jiang et al., 31 Jan 2025).

Numerical stability is enhanced by clipping [ϵ,T][\epsilon, T]1 in single-step scenarios, preventing endpoint blow-up. Real-time optimizations include all-latent computation, no inter-phase noise, and minimal model calls. CFG weights are chosen per-sample during training from [ϵ,T][\epsilon, T]2 (MotionPCM), but are not baked into the consistency objective (Jiang et al., 31 Jan 2025).


Phased Consistency Models enable deterministic, few-step diffusion-based generation with competitive or superior quality across tasks. Their phase-wise decomposition, compositional sampling, and flexibility in guidance and loss design directly address practical and theoretical deficiencies in earlier consistency frameworks (Wang et al., 2024, Jiang et al., 31 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Phased Consistency Models (PCMs).