DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation (2506.03123v1)

Published 3 Jun 2025 in cs.CV

Abstract: Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient \textbf{Dual-Expert Consistency Model~(DCM)}, where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert.Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at \href{https://github.com/Vchitect/DCM}{https://github.com/Vchitect/DCM}.

Summary

The paper presents a dual-expert framework that decouples semantic layout and fine-detail refinement to address temporal inconsistencies in video synthesis.
It employs parameter-efficient techniques, including Low-rank Adaptation and timestep-dependent layers, to optimize the denoising process while reducing computational overhead.
The model integrates Temporal Coherence, GAN, and Feature Matching Losses to enhance motion consistency and visual quality with fewer sampling steps.

Dual-Expert Consistency Model for Video Generation

The paper "DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation" presents an innovative approach to video synthesis using diffusion models. The authors address a key limitation of existing diffusion models: the need for iterative denoising steps, which are computationally intensive. Current methods suffer degradation in temporal consistency and details when applied directly to video diffusion, primarily due to discrepancies in optimization gradients and loss contributions across different timesteps.

To mitigate these issues, the paper introduces a Dual-Expert Consistency Model (DCM) framework. The core idea is to decouple the learning process into two specialized "experts": a semantic expert and a detail expert. The semantic expert focuses on learning semantic layout and motion, while the detail expert is responsible for the fine-detail refinement. Additionally, the authors propose a Temporal Coherence Loss to maintain motion consistency and apply GAN along with Feature Matching Loss to enhance synthesis quality.

Methodology and Key Contributions

The paper provides a sophisticated analysis of the training dynamics of consistency models to identify discrepancies in loss contributions and optimization gradients. This analysis forms the foundational basis for the proposed Dual-Expert framework, which efficiently trains two denoisers tailored to different stages of the synthesis process. The innovative parameter-efficient design retains visual quality while minimizing computational overhead.

Decoupled Training: By segmenting the Ordinary Differential Equation (ODE) trajectory into semantic and detail phases, distinct expert denoisers are optimized to capture unique aspects of the video synthesis process.
Parameter-Efficient Design: Through the use of Low-rank Adaptation (LoRA) and additional timestep-dependent layers, the approach enhances parameter efficiency without sacrificing performance, allowing the dual-expert system to run with minimal computational cost.
Expert-Specific Loss Functions: Temporal Coherence Loss enforces motion continuity, while GAN and Feature Matching Loss improve detail quality, aligning output distribution more closely with the teacher model.

Empirical Results

The proposed model is benchmarked against state-of-the-art baselines like LCM and PCM using the HunyuanVideo and CogVideoX frameworks on the VBench dataset. The DCM demonstrates superior VBench scores (comparable to baseline models at 50 steps) in a reduced 4-step sampling environment, significantly outperforming alternatives like LCM and PCM in both semantic alignment and visual quality metrics, while maintaining a similar sampling latency.

Moreover, the inclusion of GAN for fine-detail refinement and empirical validations through visualization underscores the theoretical benefits of the dual-expert configuration. The results advocate a substantial performance improvement particularly in synthesizing high-resolution, motion-consistent video content.

Implications and Future Developments

The Dual-Expert Consistency Model contributes significantly to the field of video generation by presenting a viable method to overcome the intrinsic limitations of diffusion models in video synthesis. The dual specialization of learning tasks within the model not only enhances visual quality but provides a framework that can be extended to other domains demanding high fidelity and temporal consistency.

Practically, DCM sets a new benchmark for video synthesis in terms of both efficiency and output quality. The model’s foundation in consistency distillation suggests promising extensions in other generative domains, such as high-resolution image and 3D model synthesis. The paper’s methodology could spearhead future innovations focused on parameter-efficient and task-specific expert models.

Overall, the Dual-Expert Consistency Model exemplifies a crucial step forward in addressing the computational and qualitative challenges present in the contemporary landscape of video synthesis, highlighting its substantial contributions to the ongoing advancement of artificial intelligence in multimedia applications.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - Vchitect/DCM: DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation (10 stars)

Tweets

https://twitter.com/liuziwei7/status/1931015441117520322

https://twitter.com/scy994/status/1930278533982760998

https://twitter.com/angrypenguinPNG/status/1931201265410728247

https://twitter.com/_akhaliq/status/1930276930546188711

https://twitter.com/techskunkworks/status/1931272061722472463

https://twitter.com/taziku_co/status/1931138493893517497