Applicability of continuous-time consistency models (sCM) to large-scale text-to-image and video diffusion

Determine whether continuous-time consistency models (sCM) can be practically applied to large-scale text-to-image and text-to-video diffusion models, given infrastructure challenges in Jacobian–vector product computation and the limitations of standard evaluation benchmarks, and ascertain the conditions under which such applicability holds.

Background

Continuous-time consistency models (sCM) are presented as theoretically principled and empirically effective for accelerating academic-scale diffusion, avoiding discretization errors and decoupling training from specific samplers. However, scaling sCM to application-level models introduces challenges: modern infrastructure commonly uses BF16 precision, FlashAttention, and context parallelism, complicating Jacobian–vector product (JVP) computation; and widely used evaluation settings (e.g., FID on weakly conditioned ImageNet) inadequately capture fine-grained, strongly conditioned tasks like text rendering.

The paper motivates the need to clarify sCM’s applicability to large-scale text-to-image and text-to-video tasks under these constraints. It then proposes infrastructure (a FlashAttention-2 JVP kernel and parallelism compatibility) and explores empirical behavior, identifying quality limitations and introducing score-regularized rCM. The initial uncertainty explicitly stated concerns whether sCM is applicable at scale given infrastructure and evaluation issues.

References

Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian–vector product (JVP) computation and the limitations of standard evaluation benchmarks.

— Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency (2510.08431 - Zheng et al., 9 Oct 2025) in Abstract

Applicability of continuous-time consistency models (sCM) to large-scale text-to-image and video diffusion

Sponsor

Background

References

Related Problems