Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation (2505.18875v3)

Published 24 May 2025 in cs.CV

Abstract: Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. Our code is open-sourced at \href{https://github.com/svg-project/Sparse-VideoGen}{https://github.com/svg-project/Sparse-VideoGen}.

Summary

The paper introduces semantic-aware permutation to efficiently identify and process critical tokens in Diffusion Transformers for video generation.
It integrates a centroid-based estimation and customized attention kernel to optimize computational budgets and balance speed with quality.
Results show up to 2.30x speedup on HunyuanVideo and PSNR improvements up to 30, underscoring practical performance gains.

Sparse VideoGen2: Accelerating Video Generation through Sparse Attention and Semantic-Aware Permutation

Introduction

Sparse VideoGen2 (SVG2) is a novel approach designed to enhance the efficiency of video generation in Diffusion Transformers (DiTs) by leveraging sparse attention mechanisms. DiTs are powerful in generating high-quality videos but suffer from significant latency due to the quadratic complexity of attention mechanisms. SVG2 addresses this bottleneck by focusing on the efficient computation of critical tokens through sparse attention, thus achieving a balance between computational efficiency and generation quality.

Technical Approach

SVG2 introduces a semantic-aware permutation method to optimize token clustering, focusing on semantic rather than positional similarity to improve the accuracy of critical token identification. This approach involves clustering tokens based on their semantic activations, allowing for more representative aggregated activations and better approximations of attention scores, which are crucial for identifying critical tokens accurately. The proposed semantic-aware permutation also reorganizes critical tokens into contiguous blocks, minimizing computation waste and enhancing GPU efficiency.

To further refine the model, SVG2 integrates a centroid-based estimation system that uses the cluster centroids to predict attention scores, assisting in the dynamic allocation of the computational budget. Additionally, SVG2’s implementation involves a customized attention kernel that supports variable block sizes, crucial for handling the diverse cluster sizes resulting from semantic-aware clustering.

Numerical Results

SVG2 demonstrates significant improvements in both speed and quality metrics. On a single H100 GPU, SVG2 achieves up to a 2.30 times speedup for HunyuanVideo and a 1.89 times speedup for Wan 2.1, with impressive PSNR values of up to 30 and 26, respectively. These results underscore SVG2's ability to maintain high video quality while significantly reducing latency (Figure 1).

Figure 1: SVG2 accelerates video generation while maintaining high quality. On a single H100, for HunyuanVideo and Wan 2.1, SVG2 achieves up to 2.30 and 1.89 end-to-end speedup, with a PSNR up to 30 and 26.

SVG2 builds on prior research by incorporating semantic clustering techniques, which differ from common position-based clustering methods that often lead to suboptimal predictions of attention scores and increased computation waste. SVG2's dynamic sparsity approach aligns with recent advances in adaptive attention mechanisms, focusing on runtime selection and clustering of critical tokens based on real-time data semantics.

Implications and Future Work

The introduction of semantic-aware permutation and centroid-based selection techniques in SVG2 provides a substantial leap toward efficient video generation using Diffusion Transformers. The practical implications are profound, suggesting that future frameworks could incorporate similar clustering methods to further exploit natural data sparsity. As video complexity and resolution demands continue to escalate, SVG2 offers a promising avenue toward sustainable computational demand without compromising on quality.

Potential future developments include exploring the extension of semantic-aware clustering to other domains of attention mechanisms beyond video generation or applying these principles to improve multi-modal generative models.

Conclusion

SVG2 marks a significant advancement in the field of video generation, providing an efficient and effective method for reducing latency without sacrificing output quality. By leveraging semantic-aware permutations and a robust computational framework, SVG2 paves the way for more sophisticated and scalable generative models in AI. Future research could further examine the adaptability of these methods to a broader range of generative tasks and attention-based architectures.