Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity (2502.01776v2)

Published 3 Feb 2025 in cs.CV and cs.LG

Abstract: Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code is open-sourced and is available at https://github.com/svg-project/Sparse-VideoGen

Summary

The paper introduces Sparse VideoGen (SVG), a framework that accelerates video Diffusion Transformers (DiTs) by leveraging spatial-temporal sparsity in their attention mechanisms.
SVG classifies attention heads into spatial and temporal types, uses training-free profiling to identify sparse patterns, and achieves up to 2.33x speedup on models like CogVideoX and Hunyuan Video without compromising visual quality.
This efficiency enables broader real-world applications for video generation and offers potential integration with techniques like quantization for further computational gains.

Accelerating Video Generation with Spatial-Temporal Sparsity

The paper "Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity" addresses the computational inefficiencies commonly found in Diffusion Transformers (DiTs) used for video generation. Recognized for their exceptional performance in generating high-quality video, DiTs suffer from high computational costs, primarily due to the quadratic complexity associated with 3D full attention mechanisms. This paper presents Sparse VideoGen (SVG), a framework designed to leverage the innate sparsity within 3D attention structures, thus enhancing inference efficiency.

At the core of this framework is the classification of attention heads into two distinct types: Spatial Head and Temporal Head. The Spatial Head primarily focuses on tokens within the same frame, thereby preserving spatial structures. Meanwhile, the Temporal Head attends to tokens across frames at the same spatial location, ensuring temporal consistency. Recognizing these patterns allows SVG to perform a more efficient computation by dynamically applying the appropriate sparse attention mechanism during video generation.

SVG introduces a training-free, online profiling strategy to identify sparse attention patterns at minimal computational cost. A key innovation is the efficient tensor layout transformation that optimizes hardware utilization, particularly addressing inefficiencies caused by non-contiguous data structures in GPU operations. Through selective focus based on identified sparse patterns, SVG reduces computational overhead significantly while maintaining high visual fidelity in generated videos.

Extensive evaluations demonstrate SVG’s efficacy. Notably, SVG delivers up to 2.33x speedup on representative video generative models like Cog VideoX-v1.5 and Hunyuan Video, without compromising perceptual quality. The peak signal-to-noise ratio (PSNR) for the generated videos consistently remains high, achieving up to 29.99, indicating that sparse patterns maintain the high visual quality of video generation processes.

The implications of these findings are substantial. Efficient video generation at scale presents opportunities for real-world applications in various domains, from animation to physical world simulations, constrained primarily only by computational resources. SVG's successful integration with existing generative models without retraining further underscores its practicality and adaptability in current technological ecosystems.

Looking to the future, SVG could inspire further research into adaptive attention mechanisms, exploring how sparse structures might be optimized across different contexts beyond video generation. Moreover, given its compatibility with quantization techniques like FP8, potential exists to pair SVG with emerging low-bit precision architectures to drive even further efficiencies in computationally intensive tasks.

In conclusion, Sparse VideoGen provides a compelling solution to the computational bottlenecks of video diffusion transformers, ensuring that high-quality video synthesis can be achieved more efficiently. This advancement not only paves the way for broader applications of video generative models but also invites further exploration into optimizing attention mechanisms across varied computational contexts.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/HaochengXiUCB/status/1899953252327927911

https://twitter.com/HaochengXiUCB/status/1919507666491343058

https://twitter.com/papers_anon/status/1887072792098828289