Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 106 tok/s Pro

Kimi K2 156 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

QVGen: Pushing the Limit of Quantized Video Generative Models (2505.11497v2)

Published 16 May 2025 in cs.CV

Abstract: Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ($\Phi$) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $\Phi$, we propose a rank-decay strategy that progressively eliminates $\Phi$. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $\mathbf{\gamma}$ to identify and decay low-contributing components. This strategy retains performance while zeroing out inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3$B $\sim14$B, show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench.

Collections

Summary

Overview of QVGen: Low-bit Quantized Video Diffusion Models

The paper introduces QVGen, a quantization-aware training (QAT) framework designed to enhance video diffusion models (DMs) by enabling inference-efficient operation under extremely low-bit quantization, specifically targeting configurations of 4-bit or lower. The primary challenge addressed is the substantial computational and memory requirements of video DMs, which can inhibit their deployment in real-world scenarios. This problem is particularly acute for models like Wan 14B, which demand over 30 minutes and 50GB of GPU memory to generate a 10-second video at 720p resolution on a high-end H100 GPU.

Innovations in Quantization for Video DMs

The paper delineates a clear divergence from prior techniques that have addressed quantization for image DMs, highlighting that these methods fall short when directly applied to video DMs. The QVGen framework introduces a novel QAT paradigm focused on optimizing convergence by minimizing the gradient norm during training. To this end, QVGen employs auxiliary modules specifically tailored to mitigate quantization errors, which in turn stabilize the training process—a crucial innovation given the extensive performance degradation observed in existing solutions.

Key to the QVGen framework is the use of auxiliary modules, denoted as $\Phi$ , which help narrow the quantization error gap by supplementing the quantized model during training. This approach fundamentally enhances convergence, facilitating robust training of video DMs even at 4-bit quantization levels, achieving quality comparable to full-precision models.

Eliminating Inference Overhead with Rank-Decay

A striking feature of QVGen is its rank-decay strategy, which systematically eliminates the inference overhead introduced by the auxiliary modules. By employing singular value decomposition (SVD) and a specialized rank-based regularization, the framework identifies non-contributive components of $\Phi$ and progressively nullifies them. This strategy ensures that while auxiliary modules remain active during training to aid convergence, they do not persist in the inference phase, thereby reducing overhead to zero without sacrificing model performance.

Experimental Results

Through extensive experiments involving state-of-the-art video DMs such as CogVideoX and Wan, the QVGen framework demonstrated superior performance. For instance, the 3-bit CogVideoX-2B achieved impressive gains of +25.28 in Dynamic Degree and +8.43 in Scene Consistency as measured by the VBench benchmark, surpassing existing methods. Furthermore, QVGen is the first method to attain full-precision equivalent quality under 4-bit settings, proving its efficacy across varying scales and bit-width configurations. The paper documents a clear indication that Scene Consistency remains challenging across models and methods, alongside observed recovery in Dynamic Degree.

Implications and Future Directions

The implications of QVGen's contributions are multifaceted. Practically, the framework paves the way for deploying resource-constrained video DMs on consumer-grade hardware and edge devices, broadening the applicability and accessibility of advanced video synthesis technologies. Theoretically, the approach to mitigating large quantization errors and progressively shrinking auxiliary modules exemplifies a significant advancement in QAT practices.

Looking ahead, the framework's adaptability across different model architectures, including image and multilingual tasks, poses interesting research opportunities. Future studies could explore optimizing kernel implementations to enhance inference speed further or apply QVGen's principles to other domains of artificial intelligence, potentially unlocking efficient deployment of advanced neural networks across diverse sectors.

In conclusion, the paper makes a formidable contribution to the field of quantized model training for video DMs, addressing critical challenges with methodological rigor and providing robust evidence through numerical results that showcase the framework's superiority in producing high-quality video content under constrained precisions.