Score-Regularized Continuous-Time Consistency Models (rCM)

Updated 8 May 2026

The paper introduces a dual-objective rCM framework that combines forward-KL for mode coverage and reverse-KL for mode seeking, balancing diversity and detail.
It employs custom JVP kernels and parallelism strategies like FSDP to enable efficient few-step sampling for models exceeding 10 billion parameters.
Experimental results demonstrate a 15x to 50x speedup with maintained sample fidelity, showing significant improvements in both T2I and T2V generation quality.

Score-Regularized Continuous-Time Consistency Models (rCM) are a framework for scalable, high-quality distillation of large-scale diffusion models—especially in the application domains of text-to-image (T2I) and text-to-video (T2V) generation. rCM unifies forward-KL (mode-covering) and reverse-KL (mode-seeking) objectives into a practical distillation methodology that preserves sample diversity and high-fidelity detail, while enabling few-step sampling for models exceeding 10 billion parameters (Zheng et al., 9 Oct 2025).

1. Mathematical Foundations

Continuous-Time Consistency Models (sCM)

Given a pre-trained diffusion "teacher" model defined by the probability-flow ODE

$\frac{dx_t}{dt} = f_{\rm teacher}(x_t, t)$

which maps data $x_{t=0} \sim p_{\rm data}$ to terminal noise at $t=T$ , a student consistency model $f_\theta(x_t, t)$ is trained to predict the original $x_0$ directly from $x_t$ at arbitrary time $t$ . Discrete-time consistency minimizes prediction error over intermediate states, while the continuous-time limit yields the sCM objective: $L_{\rm cons}(\theta) = \mathbb{E}_{x_0, t} \big\| f_\theta(x_t, t) - f_{\theta^-}(x_t, t) - w(t) \frac{d\,f_{\theta^-}(x_t, t)}{dt} \big\|_2^2$ with $w(t) = \cos t$ and $x_t = \cos t\,x_0 + \sin t\,\varepsilon$ . The differential

$x_{t=0} \sim p_{\rm data}$ 0

necessitates Jacobian-vector products (JVPs) for efficient computation.

Limitations of sCM

sCM objectives are fundamentally forward-KL-divergence minimizing (teacher $x_{t=0} \sim p_{\rm data}$ 1 student), which enforces mode coverage but can accumulate errors and cause smoothing in reconstructed details. Specifically:

Error Accumulation: Small inaccuracies at earlier diffusion times are amplified by self-feedback, resulting in blurred fine structures and reduced temporal coherence in video synthesis.
Mode-Covering: The forward-KL focus penalizes missed data modes, broadening density at the cost of sharpness and high-frequency features.

2. Score Regularization and The rCM Loss

rCM augments sCM with a mode-seeking reverse-KL regularization, inspired by Distribution-Matching Distillation (DMD). Defining $x_{t=0} \sim p_{\rm data}$ 2 and $x_{t=0} \sim p_{\rm data}$ 3 as the time- $x_{t=0} \sim p_{\rm data}$ 4 marginals for student and teacher,

$x_{t=0} \sim p_{\rm data}$ 5

the reverse-KL component is approximated via a “fake-score” network $x_{t=0} \sim p_{\rm data}$ 6 trained using flow matching, leading to the DMD loss: $x_{t=0} \sim p_{\rm data}$ 7 where “sg” denotes stop-gradient.

The combined rCM objective is: $x_{t=0} \sim p_{\rm data}$ 8 $x_{t=0} \sim p_{\rm data}$ 9 preserves diversity, while $t=T$ 0 restores sharp detail via mode-seeking correction.

3. Implementation Techniques

FlashAttention-2 JVP Kernel

A custom Triton-based kernel fuses the JVP operation into the block-wise, tiled FlashAttention-2 forward pass. The implementation:

Accumulates both standard attention $t=T$ 1 and JVP outputs $t=T$ 2, where $t=T$ 3.
Extends to both self- and cross-attention, preserving parallelism efficiency.

Parallelism Strategies

FSDP: Each layer exposes a “JVP-aware” interface, enabling tangent and forward computations without recomputing gradients.
Ulysses Context-Parallelism: Query, key, and value (QKV) tensors are sharded and communicated by all-to-all protocols, with tangent vectors following the same pattern.

Training and Inference Protocol

A generator/critic step alternates:

Generator: Samples $t=T$ 4, computes $t=T$ 5 and JVP, forms $t=T$ 6, and—after bootstrap—forms $t=T$ 7 using backward simulation of the student.
Critic: Updates $t=T$ 8 by matching student rollouts via $t=T$ 9.
Inference: Few-step sampling ( $f_\theta(x_t, t)$ 0 steps), iterating $f_\theta(x_t, t)$ 1.

The stack employs PyTorch 2.0 (torch.func.jvp), custom Triton kernels, FSDP, Ulysses context-parallelism, BF16 mixed precision, and A100/H100 GPUs.

4. Experimental Results

Experiments span the Cosmos-Predict2 and Wan2.1 model families in T2I and T2V, with up to 14B parameters and 5-second video outputs.

Model	Params	NFE	GenEval (T2I)	VBench (T2V)
Cosmos-Predict2 (teacher)	14 B	70	0.84	—
+ DMD2	14 B	4	0.80	84.6
+ rCM	14 B	4	0.83	84.9

Speedup: rCM enables $f_\theta(x_t, t)$ 2 acceleration (number of function evaluations, NFE) relative to teacher models.
Diversity: Unlike DMD2, rCM maintains mode coverage and avoids mode collapse in T2V (e.g., varied object poses).
Qualitative Quality: rCM sharply renders fine details (e.g., text in T2I, frame-to-frame coherence in T2V).

5. Theoretical and Practical Implications

rCM leverages the complementary strengths of forward-KL (mode-covering) and reverse-KL (mode-seeking) training signals. This dual-objective strategy counters errors and smoothing artifacts observed with sCM alone, while preserving the ability to generate high-diversity samples in few steps. JVP-based self-feedback remains a computational bottleneck and source of numerical instability, especially in low precision; proposed remedies include higher-order flow formulations and Hutchinson-trace approximations.

Implementation complexity increases due to custom kernel requirements and parallelism adaptation, yet rCM achieves state-of-the-art quality without requiring adversarial finetuning or extensive hyperparameter search. A residual performance gap is noted for single-step T2V sampling.

6. Extensions and Future Work

Potential avenues for improving rCM include:

Adaptive regularization scheduling (time-dependent $f_\theta(x_t, t)$ 3).
Integration with multi-step consistency trajectory approaches (e.g., sCTM).
Incorporating adversarial objectives to further improve sample realism.
Addressing the efficiency and error limitations intrinsic to JVP computation.

A plausible implication is that rCM’s unification of consistency and score distillation principles may generalize to other generative model distillation regimes, especially where balancing sample sharpness with output diversity remains critical (Zheng et al., 9 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Score-Regularized Continuous-Time Consistency Models (rCM).

Score-Regularized Continuous-Time Consistency Models (rCM)

1. Mathematical Foundations

Continuous-Time Consistency Models (sCM)

Limitations of sCM

2. Score Regularization and The rCM Loss

3. Implementation Techniques

FlashAttention-2 JVP Kernel

Parallelism Strategies

Training and Inference Protocol

4. Experimental Results

5. Theoretical and Practical Implications

6. Extensions and Future Work

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Score-Regularized Continuous-Time Consistency Models (rCM)

1. Mathematical Foundations

Continuous-Time Consistency Models (sCM)

Limitations of sCM

2. Score Regularization and The rCM Loss

3. Implementation Techniques

FlashAttention-2 JVP Kernel

Parallelism Strategies

Training and Inference Protocol

4. Experimental Results

5. Theoretical and Practical Implications

6. Extensions and Future Work

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research