Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency (2510.08431v1)

Published 9 Oct 2025 in cs.CV and cs.LG

Abstract: This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.

Summary

The paper introduces rCM, a hybrid method that integrates forward continuous-time consistency with reverse score distillation to balance quality and diversity.
The method achieves up to 50× acceleration in image and video generation by leveraging innovations like a FlashAttention-2 JVP kernel and precision control.
Empirical results show state-of-the-art performance on T2I and T2V benchmarks such as GenEval and VBench with minimal sampling steps.

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Introduction and Motivation

This paper addresses the challenge of accelerating large-scale diffusion models for image and video generation by distilling them into efficient few-step generators. While continuous-time consistency models (sCM) offer a theoretically principled approach for distillation, their practical application to high-capacity models and complex tasks (e.g., text-to-image/video) is hindered by infrastructure limitations and quality degradation, especially in fine-detail and temporal consistency. The authors introduce a score-regularized continuous-time consistency model (rCM), which integrates score distillation as a long-skip regularizer to complement sCM, thereby improving sample quality and maintaining diversity.

Background: Diffusion and Consistency Models

Diffusion models (DMs) learn to reverse a noise-perturbation process, typically parameterized as a velocity predictor, and are trained via MSE on denoising tasks. Consistency models (CMs) shortcut the teacher ODE trajectory, directly predicting the initial data from noisy inputs. Discrete-time CMs suffer from discretization errors and require annealing schedules, while sCM eliminates these issues by operating in continuous time, leveraging Jacobian-vector product (JVP) computation for tangent estimation.

Score distillation methods, such as DMD2, match the student and teacher distributions via reverse divergence objectives, often requiring auxiliary score networks and adversarial training. These methods are state-of-the-art for large-scale distillation but tend to reduce sample diversity.

Scaling sCM: Infrastructure and Limitations

The authors develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models exceeding 10B parameters and high-dimensional video data. They restructure network layers for compatibility with FSDP and context parallelism, and implement JVP computation within the FlashAttention-2 forward pass.

Empirical evaluation reveals that pure sCM distillation, while producing sharp images, fails in scenarios demanding fine details or temporal consistency, leading to artifacts such as blurry textures and unstable object geometry.

Figure 1: 4-step generation results with pure sCM distillation, highlighting quality issues in fine-detail and temporal consistency.

Theoretical analysis attributes these distortions to error accumulation in the self-consistency objective, where the JVP term amplifies errors as the integration time increases, especially under limited numerical precision (BF16).

Score-Regularized Continuous-Time Consistency (rCM)

To address sCM's quality limitations, the authors propose rCM, which augments the forward-divergence-based sCM objective with a reverse-divergence score distillation term. This hybrid approach leverages the mode-covering property of sCM for diversity and the mode-seeking property of score distillation for quality.

Figure 2: Illustration of rCM: forward consistency propagates error, while reverse-divergence regularization stabilizes long-skip predictions.

The rCM objective is a weighted sum of sCM and DMD losses, with a fixed balancing coefficient ( $\lambda=0.01$ ) found to generalize across models and tasks. The rollout strategy for score distillation involves stochastic simulation of time steps, ensuring coverage of the entire time range and avoiding the collapse observed in fixed-step approaches.

Stable JVP computation is achieved via semi-continuous time (finite difference approximation for time derivatives) for moderate-scale models, and FP32 precision enforcement for time embedding layers in large-scale/video models.

Experimental Results

The authors distill Cosmos-Predict2 (T2I, up to 14B) and Wan2.1 (T2V, up to 14B) models, evaluating on GenEval (T2I) and VBench (T2V) benchmarks. rCM matches or surpasses DMD2 in quality metrics, while retaining superior diversity, especially in video generation.

Figure 3: Few-step T2I samples compared to open-sourced models; rCM accurately renders fine-grained text details from prompts.

Figure 4: 5 random video samples from 4-step sCM, DMD2, and rCM on Wan2.1 1.3B; rCM resolves sCM's quality issues and outperforms DMD2 in diversity.

rCM enables high-fidelity generation in only 1–4 steps, achieving up to 50× acceleration over teacher models. For T2I, GenEval scores degrade only slightly under 1- or 2-step settings, with 1-step generations nearly indistinguishable from 4-step for simple prompts. For T2V, 2-step generations approach teacher quality, while 1-step outputs are noticeably inferior.

Figure 5: Comparison between different numbers of sampling steps; rCM maintains quality with fewer steps, especially for T2I.

Quantitative results demonstrate that rCM achieves state-of-the-art scores on GenEval and VBench, with throughput improvements proportional to the reduction in sampling steps.

Implementation Details

JVP Kernel: Custom Triton kernel for FlashAttention-2, supporting both self- and cross-attention, integrated with FSDP and context parallelism.
Network Restructuring: Layers support both standard and JVP-mode forward passes, with attention blocks using the custom JVP kernel.
Training: Full-parameter tuning with AdamW, power EMA smoothing, and alternating student/fake score optimization.
Precision: FP32 enforced for time embedding in large models; semi-continuous time for moderate models.
Rollout: Stochastic simulation of time steps for score distillation, ensuring diversity and stability.

Trade-offs and Limitations

Quality vs. Diversity: rCM achieves a balance, outperforming DMD2 in diversity while matching quality. Pure score distillation methods tend to collapse diversity.
Numerical Stability: JVP computation is sensitive to precision; FP32 enforcement is necessary for large models.
Scalability: Infrastructure supports models up to 14B parameters and high-dimensional video data, but further scaling may require additional engineering.

Implications and Future Directions

The integration of forward- and reverse-divergence objectives in rCM provides a unifying framework for diffusion distillation, enabling efficient, high-quality, and diverse generation in few steps. This paradigm may inspire new research in generative modeling, particularly in balancing quality and diversity without adversarial training or extensive hyperparameter tuning.

Potential future developments include:

Extending rCM to other modalities (e.g., audio, 3D)
Further optimization of JVP computation for extreme-scale models
Exploration of alternative regularization strategies for improved stability and generalization

Figure 6: Relative $L_2$ errors of the network output and JVP under BF16 precision; JVP computation incurs substantially larger numerical errors.

Conclusion

The score-regularized continuous-time consistency model (rCM) advances the state of the art in large-scale diffusion distillation, delivering competitive quality and diversity in a highly efficient framework. The proposed infrastructure and algorithmic innovations enable practical deployment of few-step generators for complex image and video tasks, with significant implications for both research and real-world applications.

PDF Markdown

Whiteboard

Generate a whiteboard explanation of this paper.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is about making AI image and video generators much faster without losing quality. It focuses on a kind of AI called diffusion models, which are great at making detailed, diverse pictures and videos but are slow at creating them. The authors show how to “distill” (compress and speed up) big, real-world text-to-image and text-to-video models so they can produce high-quality results in just 1–4 steps instead of dozens. They introduce a new method called rCM (score-regularized continuous-time consistency model) that improves visual quality and keeps variety, even at very high speed.

What questions did the researchers ask?

They asked:

Can we scale a promising distillation technique called continuous-time consistency (sCM) from small research models to huge, real-world image and video models?
Why does sCM sometimes produce fine-detail problems (like fuzzy text or shaky videos), and how can we fix that?
Can we combine sCM with another approach (score distillation) to get both high quality and strong diversity?
Will this work on very large models (up to 14 billion parameters) and longer videos, while staying fast?

How did they do it? (Methods explained simply)

Think of a diffusion model like a careful artist who starts with a noisy canvas and cleans it up step by step to reveal a picture. This process usually takes many steps. Distillation tries to teach a “student” artist to do the same job in just a few steps while staying accurate.

Here are the key ideas:

Consistency models (CMs) and continuous-time consistency (sCM):
- A consistency model learns a shortcut: from any point during the cleaning process, it can jump back to the clean picture in one move.
- “Continuous-time” means it doesn’t rely on fixed steps; instead, it works smoothly for any tiny time change, which avoids some numerical errors.
Why sCM struggles with quality:
- sCM tries to learn “how the picture changes over time” by measuring tiny nudges (this uses something called a JVP—Jacobian–vector product, which is like asking: “If I nudge the input a little, how does the output change?”).
- In big models and videos, these tiny-nudge calculations can be fragile. Small mistakes at early times can accumulate into bigger errors later, leading to distorted details or unstable motion.
The fix: rCM (score-regularized sCM)
- The authors add “score distillation” as a regularizer (a helpful extra rule).
- You can think of two training styles:
- Forward style (like sCM): learns from real data or the teacher’s outputs and tends to cover many possibilities (“mode-covering”), which keeps diversity but can make samples look spread out or slightly blurry.
- Reverse style (score distillation): learns from the student’s own samples and pushes toward the sharpest, most likely results (“mode-seeking”), which improves visual quality but can reduce diversity.
- rCM combines both: sCM keeps diversity; score distillation fixes fine-details and sharpness. Together, they balance quality and variety.
Making it work at scale (infrastructure):
- Big models need special tools. The authors built:
- A custom FlashAttention-2 JVP kernel: a fast, memory-efficient way to compute those tiny-nudge (JVP) signals inside attention layers, which are common in large models.
- Compatibility with FSDP (splitting model across GPUs) and context parallelism (splitting long sequences across GPUs), so training works for huge image and video models.
- They also stabilized time-related calculations:
- Semi-continuous trick: approximate “how fast things change in time” using a tiny time difference (like a careful estimate).
- High-precision time: compute time embeddings in higher precision to avoid wobbling or crashes in very large models.
Training setup:
- They tested on large text-to-image and text-to-video models (Cosmos-Predict2 and Wan2.1), including versions with up to 14 billion parameters and videos up to 5 seconds.
- The student model learns from the teacher, and a helper “fake score” network estimates the student’s score (like a critic telling how well the student matches the teacher).
- They used simple, steady hyperparameters (no tricky multi-stage GAN training), which makes rCM practical.

What did they find, and why is it important?

Main results:

rCM fixes sCM’s fine-detail problems:
- Images: rCM renders small text and fine details much better than sCM alone.
- Videos: rCM reduces flicker and geometric distortions, improving temporal consistency.
rCM matches or beats strong baselines in quality:
- It performs on par with or better than DMD2 (a state-of-the-art distillation method) on key benchmarks:
- GenEval for images (tests things like counting objects, color binding, positions)
- VBench for videos (tests motion quality, clarity, and alignment with prompts)
rCM preserves and even improves diversity:
- Compared to DMD2, rCM produces more varied outputs (for example, objects don’t collapse into the same positions or orientations).
Big speedups with few steps:
- High-fidelity samples in just 1–4 steps.
- About 15×–50× faster than the teacher models.
- For text-to-image, single-step generation can be usable; for text-to-video, 2–4 steps yield high quality.

Why it matters:

Faster generation means users get results quickly, which is essential for real applications.
Keeping both high quality and diversity is hard; rCM achieves both.
The method works on very large models and real tasks, not just small research setups.

What does this mean for the future?

rCM shows that combining two training philosophies—forward (consistency) and reverse (score)—is a powerful way to distill big diffusion models. This balanced approach can make future AI generators:

Much faster, with far fewer steps
High-quality and sharp, including small text and fine textures
Diverse and creative, avoiding “mode collapse”
Scalable to huge models and long videos

In short, rCM is a practical, theory-backed path toward fast, reliable, and versatile image and video generation, and it may inspire new, unified methods that mix the best of both worlds.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, focused list of what remains missing, uncertain, or unexplored in the paper, phrased to guide actionable future work.

Lack of formal theory for forward+reverse divergence coupling: no convergence guarantees, no characterization of the implicit objective optimized by rCM, and no theoretical analysis of the diversity–quality trade-off induced by the weighted sum of divergences.
Error accumulation in (s)CM is only qualitatively analyzed: no quantitative bounds on the self-feedback term from JVP, no characterization of how errors scale with time t, model size, precision, or number of steps, and no principled regularization schedules to mitigate it.
Sensitivity to key hyperparameters is untested: no ablations on the balancing weight λ, the time distributions p_G and p_D, or the rollout step distribution N; unclear whether adaptive or curriculum schedules outperform the fixed λ=0.01.
Limited exploration of reverse-divergence variants: the choice of DMD over SiD is justified by memory considerations but lacks a systematic comparison; it is unknown when SiD or other reverse divergences (e.g., Fisher, Jensen–Shannon) might yield better stability or quality at scale.
Fake score network design and dynamics remain under-explored: no ablations on its capacity, architecture, update frequency, teacher-initialization dependency, or the impact of student distribution shift; stability properties of the adversarial interplay (student vs. fake score) are not analyzed.
Numerical stability of time-derivative computation is ad hoc: semi-continuous finite differences (Δt=1e−4) and FP32-only time embeddings work empirically, but their error profiles, stability regions, and performance–efficiency trade-offs are not quantified, especially for >14B models or longer videos.
JVP infrastructure portability is unclear: the custom FlashAttention-2 JVP kernel, FSDP, and CP integration are not evaluated on diverse hardware/backends (e.g., AMD/TPU), different attention kernels (e.g., FlashAttention-3), or alternative parallelization strategies; overheads vs. PyTorch jvp are not reported.
Dependence on BF16 and mixed-precision tricks is not rigorously evaluated: the impact of precision choice on JVP error, training stability, and quality is not systematically measured across model scales and tasks.
Generalization across teachers and parameterizations is untested: wrapping teachers from different schedules/architectures (e.g., SDXL, FLUX, SD3.5) is presented as theoretically valid, but empirical robustness of SNR mapping, precision choices (FP64 wrapping), and CFG conversion is not assessed.
CFG handling is narrow: the method distills a single CFG setting; it is unknown whether rCM supports arbitrary guidance at inference, how performance varies across CFG scales, or how to train for CFG-adjustable students.
Limited scope of evaluation: no quantitative diversity metrics (e.g., precision/recall, coverage, intra-prompt variation) accompany the diversity claims; no user studies; GenEval/VBench do not capture safety, photorealistic fidelity, or long-range temporal consistency comprehensively.
Video scope is restricted: results are limited to 480p/720p and ∼5 s; scalability to 1080p/4K, longer durations, variable frame rates, and more complex motion remains untested; physical and causal consistency metrics are not reported.
1-step T2V generation remains weak: the paper shows clear quality drops at 1 step but does not explore remedies (e.g., stronger long-skip regularizers, curriculum sampling, trajectory consistency models) to close the 1→2 step gap.
Comparisons to related consistency/trajectory methods are incomplete: no large-scale head-to-head results vs. MeanFlow, AYF, multistep CTMs, or Hyper-SD under the same teachers, data, and compute budgets.
Data dependence and synthetic-only training are unverified: although the paper claims rCM could train solely on teacher-generated data, this is not empirically demonstrated; the effect of real vs. synthetic data mixes on quality/diversity is unknown.
Compute and efficiency reporting is incomplete: throughput is reported for inference, but training compute (GPU hours), memory footprint vs. DMD2, and scaling laws of convergence speed/quality with model size are not provided.
Robustness to prompt distribution shifts is untested: performance on out-of-distribution prompts, multilingual text rendering, long/complex instructions, numeracy, fine-print text across fonts and languages, and adversarial prompts is unknown.
Multi-modality and controllability are not addressed: extension to audio, 3D/multi-view, control signals (poses, depth, sketches), or multi-constraint conditioning is not explored.
VAE interaction is unstudied: whether VAE bottlenecks limit fine-detail fidelity under few-step rCM, and whether latent-space consistency vs. pixel-space choices affect quality/diversity, are not analyzed.
Sampling procedure design space is narrow: the paper uses alternating reverse-denoise/forward-noise steps; alternatives (e.g., learned samplers, trajectory models, solver-informed CM steps) and their effects on error accumulation and fidelity are unexplored.
Stability over long training is fragile: while the paper proposes time-derivative fixes to prevent late-stage collapse, failure modes, early-warning signals, and principled stabilization strategies (e.g., spectral norm, gradient clipping targeted to JVP paths) are not systematically investigated.
Porting to resource-limited regimes is unclear: rCM is validated with full-parameter tuning; behavior under LoRA/partial fine-tuning, smaller batch sizes, or lower-precision hardware is not evaluated.
Safety, bias, and memorization are not assessed: no measurements of harmful content generation, demographic bias, or training data leakage, especially when mixing real and synthetic data.
Long-skip regularization schedule is fixed: no exploration of time-dependent λ(t), selective application to large t, or adaptive balancing based on JVP magnitude/error proxies to better target sCM’s error accumulation regime.
Interaction with adversarial objectives is underexplored: while rCM avoids explicit GAN tuning, it is unknown whether combining rCM with lightweight adversarial heads (as in DMD2) could further improve fine details without sacrificing diversity.
Practical reproducibility gaps: the paper does not detail whether the custom JVP kernels and training recipes (precision scopes, CP configurations, checkpointing policies) are fully released and portable, which affects the community’s ability to validate and extend the results.

View Paper Prompt View All Prompts

Glossary

AdaGN: A conditional normalization layer (Adaptive Group Normalization) used in some diffusion architectures; noted here for instability with certain time embeddings. "As our concerned models do not involve the unstable Fourier time embedding or AdaGN layers mentioned in sCM, but instead adopt positional time embedding, AdaLN, and QK normalization, we keep the network structure."
AdaLN: Adaptive Layer Normalization that modulates features based on conditioning signals in large diffusion models. "As our concerned models do not involve the unstable Fourier time embedding or AdaGN layers mentioned in sCM, but instead adopt positional time embedding, AdaLN, and QK normalization, we keep the network structure."
Adversarial distillation: A distillation approach that adds adversarial (GAN-like) objectives to train few-step generators. "score distillation~\citep{wang2023prolificdreamer,luo2023diff,yin2024one,yin2024improved,salimans2024multistep,zhou2024score} and adversarial distillation~\citep{sauer2024adversarial,sauer2024fast,lin2024sdxl,lin2025diffusion}."
All-to-all operation: A distributed communication primitive that exchanges tensor slices among GPUs, crucial for context/sequence parallel attention. "An all-to-all operation then redistributes QKV to [B, H/P, L, C] for local attention, followed by another all-to-all to restore the sequence partition."
BF16 precision: A 16‑bit floating-point format (bfloat16) used to reduce memory and improve throughput in large-model training. "Modern large-model training typically relies on infrastructures such as BF16 precision, FlashAttention and context parallelism (CP), which complicate and incur numerical errors in sCM's Jacobianâvector product (JVP) computation."
Classifier-free guidance (CFG): A technique that adjusts conditional generation by mixing conditional and unconditional predictions to strengthen prompt alignment. "The teacher denoiser employs classifier-free guidance (CFG)~\citep{ho2022classifier}, which is simultaneously distilled into the student."
Consistency models (CMs): Generative models that learn a mapping from any diffusion time t to the initial data, enabling few-step or one-step sampling. "Consistency models (CMs)~\citep{song2023consistency} aim to learn a consistency function $f_\theta: (x_t, t) \mapsto x_0$ ..."
Consistency trajectory models (CTM): Methods that learn the trajectory of the probability flow ODE to improve few-step generation. "consistency trajectory models (CTM)~\citep{kim2023consistency}"
Context parallelism (CP): A parallelization strategy that partitions the sequence dimension across GPUs to handle long inputs efficiently. "Context (or sequence) parallelism partitions the input tensor of shape B, H, L, C across P GPUs along the sequence dimension L..."
Denoiser: The diffusion network that predicts clean data, noise, or velocity; in this paper, the teacher/student model used for guidance and distillation. "The teacher denoiser employs classifier-free guidance (CFG)~\citep{ho2022classifier}, which is simultaneously distilled into the student."
Distribution Matching Distillation (DMD): A reverse‑KL‑based distillation objective that matches the student’s diffused distribution to the teacher’s. "variational score distillation (VSD)~\citep{wang2023prolificdreamer,luo2023diff} considers the reverse KL divergence ( $f(r)=-\log r$ ), also known as distribution matching distillation (DMD)~\citep{yin2024one};"
DMD2: An improved DMD variant that incorporates additional adversarial training and practical tricks for higher-quality few-step synthesis. "Currently, score- and adversarial-distillation methods, such as DMD2~\citep{yin2024improved}, remain the state of the art for large-scale diffusion distillation."
Fake score network: An auxiliary diffusion model trained on student outputs to approximate the intractable student score for reverse-divergence losses. "As the student score $\nabla_{x_t}\log p_\theta^t(x_t)$ is intractable for the few-step generator $x_\theta$ , an auxiliary fake score network is introduced."
Fisher divergence: A discrepancy measure based on gradients of log-density ratios, used in score-based distillation (SiD). "the more recent score identity distillation (SiD)~\citep{zhou2024score} considers the Fisher divergence $f(r)=\|\nabla_{x_t}\log r\|_2^2$ ."
FlashAttention-2: A fused, memory‑efficient attention kernel widely used to scale transformer training. "FlashAttention-2~\citep{dao2023flashattention} is widely used in large-scale training to reduce memory cost and improve throughput."
Flow matching: A training objective where the model’s velocity equals the ODE’s drift, enabling efficient sampling. "the PF-ODE is simplified to $\frac{d x_t}{d t}=v_\theta(x_t,t)$ , commonly known as flow matching~\citep{lipman2022flow}."
Forward divergence: A divergence minimized on real or teacher samples that penalizes underestimation of likelihoods, typically encouraging mode covering. "Despite the theoretical existence of forward divergence, GANs in practice still suffer from limited diversity and model collapse."
Forward-mode automatic differentiation: An AD mode that propagates tangent vectors to compute Jacobian–vector products efficiently. "can be computed using forward-mode automatic differentiation, Jacobian-vector product (JVP)."
Fully Sharded Data Parallel (FSDP): A training strategy that shards model parameters across GPUs to reduce memory usage. "Fully Sharded Data Parallel (FSDP)~\citep{zhao2023pytorch} reduces the memory footprint by partitioning models across GPUs..."
Generative adversarial networks (GANs): Generative models trained via adversarial competition between a generator and discriminator; noted for diversity issues in this context. "generative counterparts like generative adversarial networks (GANs)~\citep{goodfellow2014generative}, albeit suffering from slow inference."
GenEval: A benchmark for evaluating compositional text‑to‑image generation (e.g., counting, spatial relations). "We use GenEval~\citep{ghosh2023geneval} to evaluate T2I models on complex compositional prompts, such as object counting, spatial relations, and attribute binding."
Jacobian–vector product (JVP): The product of a Jacobian matrix with a vector; used to compute time derivatives and tangents in sCM via forward-mode AD. "While JVP can be computed with PyTorchâs built-in forward-mode operator torch.func.jvp, it is not natively compatible with large-scale training setups..."
Likelihood ratio: The ratio of teacher to student densities, a core quantity in f‑divergence distillation objectives. "where $r_{p_teacher^t,p_\theta^t}(x_t)=\frac{p_teacher^t(x_t)}{p_\theta^t(x_t)}$ is the likelihood ratio."
MeanFlow: A one‑step generative modeling approach arising from combining sCM with consistency trajectory models. "When combined with consistency trajectory models~\citep{kim2023consistency,heek2024multistep}, sCM further gives rise to the popular MeanFlow~\citep{geng2025mean}."
Mode-covering: A property of forward divergences that spreads probability mass to cover all modes, often at the expense of sample sharpness. "Forward divergence is known to encourage ``mode-covering" by penalizing underestimation of any training sample likelihoods..."
Mode-seeking: A property of reverse divergences that concentrates probability mass on high-density regions, improving visual quality but reducing diversity. "In contrast, reverse divergence is inherently ``mode-seeking'' and beneficial to the visual quality of diffusion models..."
Non-saturating GAN loss: A discriminator objective that avoids saturation, improving gradient flow in adversarial training. "incorporating the non-saturating GAN loss to supplement DMD training."
Probability flow ordinary differential equation (PF-ODE): The deterministic ODE whose trajectories follow the marginal distributions of the diffusion process. "The sampling process of diffusion models can follow the probability flow ordinary differential equation (PF-ODE) $d x_t = \left[f(t)x_t - \frac{1}{2}g^2(t)\nabla_{x_t} \log q_t(x_t)\right] d t$ ..."
QK normalization: Normalization applied to query/key tensors in attention to stabilize training. "but instead adopt positional time embedding, AdaLN, and QK normalization, we keep the network structure."
Rectified flow: A special linear schedule ( $\alpha_t=1-t,\sigma_t=t$ ) that simplifies training and sampling. "A notable special case, rectified flow~\citep{liu2022flow}, employs the schedule $\alpha_t=1-t,\sigma_t=t$ ."
Reverse KL divergence: The KL divergence computed in the reverse direction (teacher vs. student), used by VSD/DMD. "variational score distillation (VSD)~\citep{wang2023prolificdreamer,luo2023diff} considers the reverse KL divergence ( $f(r)=-\log r$ )..."
Reverse divergence: A divergence minimized on student-generated samples, encouraging mode seeking and sharper outputs. "In contrast, reverse divergence is inherently ``mode-seeking'' and beneficial to the visual quality of diffusion models..."
Score distillation: Methods that use score functions (gradients of log-density) to match student and teacher diffused distributions. "Score distillation methods aim to match the student distribution $p_\theta$ with the teacher distribution $p_teacher$ ..."
Score Identity Distillation (SiD): A score‑based distillation approach that leverages the Fisher divergence identity for one-step generation. "the more recent score identity distillation (SiD)~\citep{zhou2024score} considers the Fisher divergence..."
Selective activation checkpointing (SAC): A memory‑saving training technique that checkpoints selected activations to trade compute for memory. "with infrastructure support from FSDP2, Ulysses CP, and selective activation checkpointing (SAC)."
Signal-to-noise ratio (SNR): The ratio of signal to noise in the forward process; matched to align time schedules across parameterizations. "by matching the signal-to-noise ratio, i.e., by solving $\frac{\sigma_{t^\text{raw}}{\alpha_{t^\text{raw}}} = \tan(t)$."
Stop-gradient: An operator that prevents gradients from flowing through a tensor during backpropagation. "where $v_\text{fake}$ is the denoiser of the fake score network, $p_D$ is a time distribution and $sg$ is the stop-gradient operator."
TrigFlow: A trigonometric noise schedule ( $\alpha_t=\cos(t),\sigma_t=\sin(t)$ ) and preconditioning used by sCM for continuous-time consistency. "sCM employs the TrigFlow noise schedule $\alpha_t=\cos(t),\sigma_t=\sin(t)$ and preconditioning $c_\text{skip}(t)=\cos(t),c_\text{out}(t)=-\sin(t)$ ..."
Ulysses strategy: A context parallelism scheme (DeepSpeed Ulysses) that reshapes QKV across GPUs for efficient local attention. "In the Ulysses~\citep{jacobs2023deepspeed} strategy, each GPU first holds a slice of size [B, H, L/P, C] for QKV."
Variational Autoencoder (VAE): A generative model whose decoder is used here in the video pipeline; reported in throughput measurements. "covering both diffusion sampling and VAE decoding stages."
Variational Score Distillation (VSD): A score‑based distillation framework that minimizes reverse KL divergence between diffused student and teacher distributions. "variational score distillation (VSD)~\citep{wang2023prolificdreamer,luo2023diff} considers the reverse KL divergence..."
Velocity parameterization: A diffusion parameterization where the network predicts the velocity along the PF‑ODE rather than noise or score. "With velocity parameterization $v_\theta$ , diffusion models are trained by minimizing the mean square error (MSE)..."
VBench: A comprehensive benchmark for text‑to‑video quality and semantic alignment. "For video generation, we adopt VBench~\citep{huang2024vbench} to systematically assess motion quality and semantic alignment."

View Paper Prompt View All Prompts

Practical Applications

Practical Applications of Score-Regularized Continuous-Time Consistency (rCM)

Below are actionable applications derived from the paper’s findings, methods, and infrastructure innovations. Each item is tagged with relevant sectors and includes assumptions or dependencies that may affect feasibility.

Immediate Applications

These can be deployed now with current models and tooling.

Accelerated text-to-image and text-to-video inference for production creative pipelines (media/entertainment, advertising, design)
- Use rCM-distilled models (1–4 steps) to render high-quality visuals and 2–5s videos with 15×–50× lower latency and cost, enabling rapid iteration, A/B testing, and campaign personalization at scale.
- Tools/products/workflows: “Turbo” T2I/T2V endpoints in cloud inference; batch generation services with diversity-aware sampling; rCM-backed creative assistants.
- Assumptions/Dependencies: Access to teacher model weights (licensing), GPU inference stack, alignment with existing sampler/CFG setups.
Real-time interactive creative tools for end-users (software, consumer apps)
- Enable near-instant image/video previews in editors and chat assistants; live prompt-tuning with responsive updates (e.g., Canva/Adobe plugins, social media content apps).
- Assumptions/Dependencies: Integration into existing UX, on-demand GPU capacity, prompt safety filters.
On-device generative media for constrained hardware (mobile, XR, edge)
- Few-step generation preserves visual quality and diversity, making short video and high-res image synthesis feasible on laptops and high-end phones and for AR/VR content.
- Tools/products/workflows: Mobile SDKs with rCM distillation targets; lightweight diffusion runtimes with CP/FSDP-compatible kernels.
- Assumptions/Dependencies: Efficient quantization, memory budgets, device GPU/NPU support; thermal limits.
Synthetic data generation at scale with preserved diversity (autonomy, robotics, vision)
- Rapidly produce varied scenes, textures, and realistic text-in-scene assets for OCR, detection, tracking, and visuomotor pretraining.
- Assumptions/Dependencies: Domain-specific prompt curation; control signal integration (layout/conditioning), dataset governance.
Faster content prototyping with fine-detail text rendering (design, marketing)
- rCM fixes sCM’s small-text failures; reliably render typography (e.g., signage, watches, packaging) for creative mockups.
- Assumptions/Dependencies: High-quality teacher initialization; prompt engineering; CFG distillation alignment.
Video-first use cases needing diversity without GAN collapse (advertising, entertainment)
- rCM maintains sCM’s diversity while matching or beating DMD2’s quality, beneficial for generating multiple distinct shots/angles/arrangements from one prompt.
- Assumptions/Dependencies: Diversity-aware sampling strategy; consistent content safety review.
Cost and energy savings in large-scale inference (cloud, sustainability)
- Replace 50–100-step samplers with 1–4 steps to reduce GPU-hours and CO₂ footprint, improving SLA adherence and cost per asset.
- Assumptions/Dependencies: Operational monitoring; acceptance of small quality deltas on edge cases.
Infrastructure adoption: FlashAttention-2 JVP kernel and JVP-compatible parallelism (ML infrastructure)
- Immediate use in training pipelines for large models (10B+) with FSDP and context parallelism; unlocks continuous-time consistency training at scale.
- Tools/products/workflows: Triton-based FlashAttention-2 JVP module; PyTorch layer-level JVP refactors; Ulysses CP-compatible attention JVP.
- Assumptions/Dependencies: NVIDIA GPU stack, Triton, BF16/FP32 precision; engineering resources to integrate kernels.
Academic benchmarking and baselining (academia)
- rCM offers a strong and stable few-step baseline for T2I/T2V with documented metrics (GenEval, VBench), enabling reproducible comparisons of divergence-based distillation strategies.
- Assumptions/Dependencies: Access to models and prompts; consistent evaluation protocols.
Enterprise-grade distillation workflows (software, MLops)
- One-pass integration of forward-divergence sCM and reverse-divergence DMD (λ≈0.01 works broadly) without multi-stage GAN tuning; distill CFG concurrently.
- Tools/products/workflows: Distillation pipelines with student + fake-score training loops; semi-continuous time derivative or FP32 time embeddings for stability.
- Assumptions/Dependencies: Teacher weights, training compute (multi-GPU), data sources (real or synthetic), licensing.
Education and training (education, daily life)
- Classroom demos and tutorials on divergence types (mode-covering vs mode-seeking) and their practical implications; student projects on building fast generative tools.
- Assumptions/Dependencies: Simplified code releases/colabs; moderate GPUs.

Long-Term Applications

These will benefit from further research, scaling, or development.

Real-time, long-duration video synthesis at higher resolutions (media/entertainment, XR)
- Move from 2–5s to minutes-long coherent videos with sharp text and stable geometry; live generative cinematography and virtual production.
- Tools/products/workflows: Streaming inference engines with rCM; editor-integrated real-time shot generation.
- Assumptions/Dependencies: Stability for long timescales, temporal consistency controls, memory-efficient attention over long sequences.
On-device, privacy-preserving generative assistants (consumer, enterprise)
- Local distillation of proprietary teachers into compact students for offline content creation, preserving user privacy and reducing cloud dependency.
- Assumptions/Dependencies: Efficient distillation to small architectures, edge accelerators, robust safety filters.
Multimodal distillation across audio, 3D, and interactive agents (multimodal AI, robotics)
- Extend rCM’s forward/reverse divergence synergy to speech, music, 3D scene/video generation, and embodied simulation.
- Assumptions/Dependencies: New parameterizations and schedules per modality, JVP kernels for non-visual attention blocks.
Sustainable AI policy and reporting (policy, sustainability)
- Standardize efficiency KPIs for generative systems (steps, FPS, CO₂ per asset) and incentivize few-step distillation practices via procurement or regulation.
- Assumptions/Dependencies: Industry-wide benchmarks, auditable measurement protocols, stakeholder buy-in.
Safety and alignment distillation (trust & safety)
- Co-distill safety filters, content rules, and watermarking into few-step students while preserving diversity; reduce mode collapse-induced bias.
- Assumptions/Dependencies: High-quality safety datasets, robust reverse-divergence regularizers, policy harmonization.
Domain-specialized, controllable generation (enterprise verticals)
- rCM-based distillation for CAD, medical illustration/education, GIS/remote sensing overlays, and scientific visualization; controllable layouts and attributes with minimal latency.
- Assumptions/Dependencies: Domain-conditioned datasets, controllable diffusion interfaces, regulatory guardrails (e.g., healthcare).
Federated and collaborative distillation (MLops)
- Distill local students per data silo (enterprise teams or regions), aggregate via weight averaging or teacher ensembles, reducing central compute.
- Assumptions/Dependencies: Robust federated protocols, compatible teacher wrappers across schedules, privacy/consent frameworks.
Framework and ecosystem integration (ML infrastructure)
- Standardized JVP APIs in major libraries (PyTorch, JAX), production-ready FlashAttention-2 JVP kernels, and CP recipes, making continuous-time consistency mainstream.
- Assumptions/Dependencies: Community maintenance, cross-hardware support, kernel portability.
New evaluation standards emphasizing diversity and fine detail (academia, standards)
- Benchmarks that capture small-text fidelity, temporal stability, and diversity (mode coverage) beyond FID; guide algorithm design and deployment thresholds.
- Assumptions/Dependencies: Broad community adoption, reliable automatic metrics, datasets with granular annotations.
Unified distillation paradigms across generators (research)
- General theory and tooling for combining forward and reverse divergences to balance quality and diversity, applicable to diffusion, flow-matching, and autoregressive models.
- Assumptions/Dependencies: Theoretical advances, stable optimization recipes, cross-model initialization strategies.

Notes on Assumptions and Dependencies

Access and licensing: Many applications assume legal access to strong teacher models (e.g., Cosmos-Predict2, Wan2.1) and their parameterizations; policy and IP constraints may apply.
Compute requirements: Distillation is training-intensive; large-scale students (up to 14B) require multi-GPU clusters with FSDP/CP; inference is lightweight post-distillation.
Numerical stability: Stability relies on semi-continuous time derivative or FP32 time embeddings; careful handling of BF16/FP32 and JVP precision is necessary.
Kernel/infrastructure availability: Adoption of Triton-based FlashAttention-2 JVP kernels and layer-level JVP refactors is critical; portability to non-NVIDIA stacks may need engineering.
Generalization: The λ≈0.01 balance and time sampling strategies work across reported models/tasks but may need adjustment for new domains/modalities.
Safety and governance: Fast generation amplifies throughput; responsible deployment requires robust content filtering, watermarking, and monitoring.

View Paper Prompt View All Prompts

Open Problems

Applicability of continuous-time consistency models (sCM) to large-scale text-to-image and video diffusion

Continue Learning

Authors (10)

Collections

Tweets

YouTube

Show All Videos

alphaXiv

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency (46 likes, 0 questions)

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency (2510.08431v1)

Sponsor

Summary

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Introduction and Motivation

Background: Diffusion and Consistency Models

Scaling sCM: Infrastructure and Limitations

Score-Regularized Continuous-Time Consistency (rCM)

Experimental Results

Implementation Details

Trade-offs and Limitations

Implications and Future Directions

Conclusion

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it? (Methods explained simply)

What did they find, and why is it important?

What does this mean for the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Practical Applications

Practical Applications of Score-Regularized Continuous-Time Consistency (rCM)

Immediate Applications

Long-Term Applications

Notes on Assumptions and Dependencies

Open Problems

Continue Learning

Related Papers

Authors (10)

Collections

Tweets

YouTube

alphaXiv