Papers
Topics
Authors
Recent
Search
2000 character limit reached

Continuous-Time Consistency Models (sCM)

Updated 8 May 2026
  • Continuous-Time Consistency Models are diffusion-based generative frameworks that use a continuous ODE structure to unify diffusion, flow matching, and distillation for fast, few-step sampling.
  • They employ the TrigFlow parameterization with adaptive normalization and gradient rescaling to ensure stable training and a continuous trajectory that minimizes discretization artifacts.
  • sCMs achieve state-of-the-art performance in image, 3D, and speech tasks by reducing network evaluations and offering scalable, high-quality generative outputs.

Continuous-time consistency models (sCM) are a class of diffusion-based generative models that unify and extend denoising diffusion models, flow matching, and consistency distillation into a simulation-free, ODE-based framework. Distinguished by their ability to guarantee fast, few-step sampling and by their continuous treatment of the diffusion trajectory, sCMs address the limitations imposed by discrete time parameterizations, such as increased hyperparameter complexity and discretization artifacts. Through rigorous mathematical construction—most notably via the TrigFlow parameterization—sCMs enable stable, scalable training and inference regimes in image, video, 3D, and other generative domains, closing the sampling quality gap with classical multi-step diffusion while requiring only one or two network evaluations and facilitating deployment at unprecedented parameter and data scales (Lu et al., 2024, Eilermann et al., 1 Sep 2025, Peng et al., 4 Jul 2025, Lee et al., 2024, Zheng et al., 9 Oct 2025).

1. Unified Theoretical Framework

The core principle of sCM is the embedding of all major probabilistic generative model formulations—EDM, flow matching, and consistency models—into a continuous-time ODE structure. This is achieved via the TrigFlow parameterization: for x0pdatax_0 \sim p_\mathrm{data} and ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2), the noisy state at time t[0,π/2]t \in [0, \pi/2] is given by

xt=costx0+sintϵ.x_t = \cos t \cdot x_0 + \sin t \cdot \epsilon.

The associated probability-flow ODE is

dxtdt=σdvθ(xt/σd,t),\frac{dx_t}{dt} = \sigma_d v_\theta(x_t / \sigma_d, t),

where vθv_\theta is a neural network parameterizing the instantaneous velocity. The unified “denoiser” fθf_\theta recovers the clean sample by a one-step Euler update: fθ(xt,t)=costxtsintσdvθ(xt/σd,t),f_\theta(x_t, t) = \cos t \cdot x_t - \sin t \cdot \sigma_d v_\theta(x_t / \sigma_d, t), with the property fθ(x,0)=xf_\theta(x, 0) = x.

Rather than integrating the ODE at inference, sCM training enforces a consistency condition: fθ(xt,t)f_\theta(x_t, t) predicts ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2)0 identically along the backward ODE trajectory for all ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2)1. The continuous-time consistency objective optimizes the tangent of ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2)2 along PF-ODE paths using the exact chain-rule gradient, removing the need for discretized simulation (Lu et al., 2024).

2. Instability Diagnosis and Theoretical Stabilization

Previous continuous-time CM approaches were constrained by severe optimizer instabilities traced to the time derivative ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2)3 appearing in the training gradient. Key sources included:

  • Blow-up of time embeddings when using log/arctan warping (as in EDM at ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2)4)
  • High-variance terms arising from large derivatives in high-frequency time embeddings
  • Accumulation of model or gradient noise due to improper normalization in group-norm layers and tangent scaling

sCM addresses these by implementing:

  • An identity time-warp schedule, without any log/EDM reparameterization
  • Low-frequency sinusoidal (positional) embeddings for time input with small scale ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2)5
  • Adaptive double normalization (“AdoubleGN”) to remove training instability from group-norm interaction with time
  • Explicit gradient normalization by rescaling the tangent in the loss by ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2)6
  • Adaptive, learned time weighting to minimize per-timestep loss variance—implemented via a scalar weight ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2)7 trained jointly with the model A “tangent warmup” factor ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2)8 multiplies destabilizing terms and is ramped up during early training (Lu et al., 2024).

3. Model Parameterization and Training Objectives

The sCM family features closed-form model predictions for ϵN(0,σd2)\epsilon \sim \mathcal{N}(0, \sigma_d^2)9 using the TrigFlow basis, applicable both to image (Lu et al., 2024), 3D point clouds (Eilermann et al., 1 Sep 2025), and speech (Nishigori et al., 16 Jul 2025). Training objectives are formulated as simulation-free mean-squared distances along the ODE trajectory, or as continuous-time limit gradients.

For image generation, the loss is

t[0,π/2]t \in [0, \pi/2]0

with all tangent terms computed analytically for TrigFlow (Lu et al., 2024, Chen et al., 12 Mar 2025).

In 3D shape domains, the analytic flow-matching objective

t[0,π/2]t \in [0, \pi/2]1

is blended with a time-dependent Chamfer distance term, without requiring Jacobian–vector products (JVPs) (Eilermann et al., 1 Sep 2025).

For high-parameter models, scalable JVP computation is enabled by extending FlashAttention-2 kernels to propagate tangents, yielding up to 80–90% of standard throughput (Zheng et al., 9 Oct 2025).

4. Sampling Algorithms and Fast Few-Step Generation

sCM architectures enable sampling with one or two explicit ODE steps, without resorting to numerical solvers or pre-trained teacher guidance. The two-step DDIM-like sampler for images is as follows: t[0,π/2]t \in [0, \pi/2]2 This design generically supports arbitrary large models, with typical NFE (number of function evaluations) as low as 2 for high-quality image or point cloud generation (Lu et al., 2024, Eilermann et al., 1 Sep 2025).

For 3D point clouds, one-step or two-step Euler/Heun inference matches or outperforms latent and diffusion baselines on ShapeNet, with no need for expensive latent autoencoders or teacher-student distillation (Eilermann et al., 1 Sep 2025).

5. Empirical Performance and Scalability

sCM establishes new Pareto frontiers in sample quality vs. speed. On ImageNet 512×512 with 1.5B-parameter models, sCM via distillation achieves FID of 1.88 with 2 steps (vs. teacher EDM2-XXL 1.73 at 63 steps), outperforming prior approaches in both speed-up and sample quality. Scaling ablations show monotonic FID improvement with increased model size, and the gap to multi-step diffusion models reduces to ≤10% in FID (Lu et al., 2024). In 3D generation, ConTiCoM-3D attains competitive Chamfer and EMD metrics at orders-of-magnitude faster sampling rates compared to diffusion counterparts (Eilermann et al., 1 Sep 2025).

For large-scale text-to-image/video models, score-regularized sCM (“rCM”) achieves GenEval=0.83 on Cosmos-Predict2 (14B) in 4 steps, and VBench=84.9 on Wan2.1 (14B) in 4 steps, both matching or exceeding state-of-the-art DMD2, while retaining superior sample diversity (Zheng et al., 9 Oct 2025).

6. Domain Extensions and Specialized Variants

The sCM principle is generic and has been adapted to multiple modalities:

Truncation-based sCM variants (TCM) improve one-step/two-step metrics by explicitly focusing network capacity on high-noise, generation-like tasks through subinterval training and robust boundary-preserving parameterization (Lee et al., 2024).

7. Limitations and Open Challenges

While sCM narrows the gap to classical diffusion, several open issues remain:

  • Training Instability: Sensitivity to time embedding, tangent normalization, and JVP quality persists, particularly at extreme timesteps and in low-precision arithmetic (Lu et al., 2024, Zheng et al., 9 Oct 2025)
  • Mode Covering vs. Sharpness: The forward-divergence sCM objective is “mode-covering,” tending to blur fine details; rCM remedies this by integrating a “mode-seeking” score-distillation loss (Zheng et al., 9 Oct 2025)
  • Scaling: Efficient, parallel JVP computation is necessary for practical scaling to 10B+ parameter models and long video sequences (Zheng et al., 9 Oct 2025)
  • Conditionality and Diversity: Most sCM frameworks currently emphasize unconditional or class-conditional generation; extending to text/image/video or guidance-rich tasks is ongoing (Eilermann et al., 1 Sep 2025, Chen et al., 12 Mar 2025)
  • Refinement Trade-offs: Selective multi-step refinement and stratified sample-space design (as in TBCM) offer controlled fidelity-speed trade-offs but require further study (Tang et al., 25 Nov 2025)

Table: Representative sCM Applications and Results

Domain Architecture Steps Metric (best) Reference
ImageNet 512×512 UNet/EDM2-XXL 2 FID = 1.88 (Lu et al., 2024)
ShapeNet 3D Point-Voxel UNet 2 CD = 48.90, EMD = 45.21 (Eilermann et al., 1 Sep 2025)
MJHQ-30K (T2I) Pretrained SANA 1 FID = 6.52, CLIP = 28.08 (Tang et al., 25 Nov 2025)
Cosmos-Predict2 14B rCM 4 GenEval = 0.83 (Zheng et al., 9 Oct 2025)
Speech Enhancement SBCTM 1 PESQ = 3.56, RTF = 0.045 (Nishigori et al., 16 Jul 2025)

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Continuous-Time Consistency Models (sCM).