Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 80 tok/s

Gemini 2.5 Pro 28 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 38 tok/s Pro

GPT-4o 125 tok/s Pro

Kimi K2 181 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Score-Regularized Continuous-Time Consistency Model (rCM)

Updated 11 October 2025

The paper introduces a dual-divergence training objective that combines forward consistency with reverse score distillation to mitigate error accumulation and mode-covering bias.
It employs innovative computational strategies like FlashAttention-2 based JVP kernels and dual forward mode design, enabling efficient training of billion-parameter models.
Empirical results show that rCM achieves competitive sample quality and diversity using only 1–4 denoising steps, dramatically accelerating inference.

A Score-Regularized Continuous-Time Consistency Model (rCM) is a framework for accelerating large-scale continuous-time diffusion model distillation by integrating consistency training with score-based regularization. Unlike standard consistency models (sCM), which enforce trajectory-level self-consistency via a forward divergence objective, rCM supplements this with a score distillation term that imposes reverse-divergence regularization. The dual-divergence design enables rCM to achieve high-fidelity sample quality and diversity in only a few sampling steps, allowing efficient training and inference on text-to-image and video diffusion models with billions of parameters (Zheng et al., 9 Oct 2025).

1. Foundations and Motivation

The continuous-time consistency model (sCM) distills a student generator from a teacher diffusion model by enforcing instantaneous consistency along the teacher’s probability flow ordinary differential equation (ODE). This is accomplished by matching both the output and its time derivative at each time step. As models scale to hundreds of millions or billions of parameters and the data domains expand from images to high-dimensional videos, sCM exhibits limitations:

Error Accumulation: Small trajectory errors are amplified by the forward divergence objective.
Mode-Covering Bias: sCM’s loss encourages broad coverage of the teacher’s distribution, resulting in less sharp or "oversmoothed" outputs, especially in applications requiring fine-grained detail.

rCM is introduced to address these limitations by augmenting the sCM objective with a score distillation (“mode-seeking” reverse-divergence regularization) term. This leads to sharper outputs and improved precision without sacrificing sCM’s stability or sample diversity.

2. Model Formulation and Objective

For a teacher diffusion model parameterizing the probability flow ODE, the sCM student is trained with a loss that penalizes deviations from the teacher’s output and its time derivative. Formally, with $f_\theta$ as the student network, the basic loss is:

$L_{sCM}(\theta) = \mathbb{E}_{x_0 \sim p_{data}, t} \left[\left\|f_\theta(x_t, t) - f_{\theta^-}(x_t, t) - w(t) \cdot \frac{d}{dt} f_{\theta^-}(x_t, t)\right\|_2^2\right]$

where $w(t)$ is a schedule (e.g., $w(t) = \cos(t)$ ), and $\frac{d}{dt} f_{\theta^-}$ is evaluated via a Jacobian-vector product (JVP).

The score-regularization term supervises the student’s outputs on its own self-generated samples using a reverse-divergence objective (a generalized variant of Distribution Matching Distillation—DMD/DMD2):

$L_{DMD}(\theta) = \mathbb{E}_{x_0 \sim p_\theta,\, t} \left[\left\| x_0 - \text{sg}\left(x_0 - \frac{_{\text{fake}}(x_t, t) - _{\text{teacher}}(x_t, t)}{\text{mean}(x_0 - _{\text{teacher}}(x_t, t))} \right)\right\|_2^2\right]$

The total rCM objective is thus

$L_{rCM}(\theta) = L_{sCM}(\theta) + \lambda \cdot L_{DMD}(\theta)$

with $\lambda \approx 0.01$ empirically sufficient for balancing quality and diversity. The “sg” operator denotes a stop-gradient to prevent gradient leakage into the student.

3. Technical Innovations for Large-Scale Training

Efficient, scalable training of rCM on models exceeding 10–14B parameters and high-dimensional tasks such as long-horizon video synthesis necessitates:

FlashAttention-2-based JVP Kernels: The custom Triton-based kernel incorporates the Jacobian-vector product directly into the FlashAttention-2 attention operator, enabling memory-efficient tangent computation and compatibility with data/model parallelism frameworks including Fully Sharded Data Parallel (FSDP) and context/sequence parallelism.
Dual Forward Mode Design: All network modules (convolutions, normalization layers, attention blocks) are refactored to accept tangent inputs, ensuring consistent JVP computation throughout the model, eliminating numerical fragility observed in lower precision (BF16) computation.

These infrastructure advances allow practical, stable rCM training at unprecedented scale on modern clusters.

4. Reverse Divergence via Score Distillation

A notable innovation of rCM is its integration of score-based (reverse-divergence) distillation. Unlike forward-divergence training, which propagates consistency errors along the denoising trajectory, the reverse-divergence introduces a gradient signal that directly sharpens the output by aligning it with the teacher’s score vector field.

Mode-Seeking vs. Mode-Covering Dynamics: sCM’s forward-divergence “covers” all teacher modes, sometimes at the expense of sharp details. Score distillation injects a “mode-seeking” signal, promoting high-fidelity reconstructions and counteracting systematic error accumulation.
Empirical Resolution of Fine-Detail Quality: The addition of the long-skip score regularizer is shown to substantially improve sample crispness, the legibility of text regions, and texture reproduction, as evidenced by standardized benchmarks.

5. Empirical Performance and Benchmarks

The rCM framework is validated on large-scale text-to-image (Cosmos-Predict2, Wan2.1) and text-to-video (5s, multi-clip) diffusion models with parameter counts up to 14B.

Sampling Efficiency: rCM maintains or exceeds the sample quality and diversity achieved by state-of-the-art distillation approaches (DMD2), while requiring only 1–4 denoising steps—15× to 50× acceleration over teacher diffusion models.
Quality Metrics: rCM approaches teacher performance on GenEval (text-conditional image) and VBench (video generation) benchmarks. In video synthesis, it demonstrates distinct advantages in diversity, avoiding the collapse to static objects observed in competing distillation paradigms.
Simplicity in Protocol: No special multi-stage training, GAN fine-tuning, or exhaustive hyperparameter search is required; the method generalizes robustly across model architectures and domains.

6. Mathematical and Algorithmic Details

Key formulas central to rCM include:

sCM Tangent Calculation:

$\frac{d}{dt} f_{\theta^-}(x_t, t) = \cos(t)(f_{\theta^-}(x_t, t) - _{\text{teacher}}(x_t, t)) - \sin(t)(x_t + [d/dt f_{\theta^-}(x_t, t)]_{\text{self-feedback}})$

Overall Training Objective:

$L_{rCM}(\theta) = E[L_{sCM}] + \lambda \cdot E[L_{DMD}]$

These are implemented with infrastructure supporting high-precision tangent calculation and robust parallel execution.

7. Future Directions and Open Problems

The paper identifies several directions for extending rCM:

Unified Divergence Frameworks: Combining consistency (forward) and score distillation (reverse) signals under a general divergence minimization perspective for further increases in model quality and robustness.
Numerical Stability: Exploring higher-precision tangent computation and alternative derivative approximations to relax restrictions on hardware and further scale up model size.
New Modalities: The framework is readily extendable to new generative modeling domains, including text-to-3D and multimodal synthesis, due to its computational and architectural generality.
Architectural Innovations: Novel time embedding and network module designs informed by ODE solvers and the properties of score/consistency objectives.

Summary Table: Distillation Model Comparison

Model	Steps	FID/Image (GenEval)	Diversity (VBench/video)	GPU Throughput
DMD2	4	$\approx$ equal	Collapsed trajectories	Baseline
rCM	4	$\geq$ DMD2	Diversity preserved	15×–50× DMD2

Fine-detail fidelity and diversity improvements in rCM are evident at comparable or faster inference speeds to established baselines.

Conclusion

The Score-Regularized Continuous-Time Consistency Model (rCM) integrates continuous-time self-consistency distillation with score-based regularization to overcome forward-divergence and error accumulation limitations, achieving competitive sample quality and diversity for large-scale diffusion models with dramatic sampling acceleration. Technical advances in infrastructure, especially the FlashAttention-2 JVP kernel and dual-mode network design, underpin the scaling of rCM for practical application. The dual-divergence framework positions rCM as an efficient and theoretically grounded paradigm for high-fidelity generative modeling across image and video domains (Zheng et al., 9 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency (2025)

Follow Topic

Get notified by email when new papers are published related to Score-Regularized Continuous-Time Consistency Model (rCM).