Fast Video Generation with Sliding Tile Attention (2502.04507v3)

Published 6 Feb 2025 in cs.CV

Abstract: Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench. We make our codebase public at https://github.com/hao-ai-lab/FastVideo.

Summary

The paper introduces Sliding Tile Attention (STA), a novel method designed to accelerate video generation within Diffusion Transformers by optimizing the resource-intensive 3D attention mechanism.
STA improves efficiency through a hardware-aware, tile-based approach that leverages locality and reduces computational redundancy compared to traditional token-wise methods.
Empirical results demonstrate that STA significantly reduces end-to-end latency by over 1.35 aggressively and accelerates attention computation 10-fold with minimal impact on video quality metrics.

Fast Video Generation with Sliding Tile Attention: An Overview

The paper introduces Sliding Tile Attention (STA), a novel method designed to enhance the efficiency of attention mechanisms in video generation tasks, especially within the framework of Diffusion Transformers (DiTs). Video generation with DiTs has become computationally intensive due to the high complexity of 3D full attention mechanisms. This paper addresses the challenge of computational overhead by proposing STA, which optimizes the process of generating high-resolution videos through a novel attention mechanism that capitalizes on the natural redundancy inherent in video data.

Diffusion Transformers and the Bottleneck

DiTs employ a 3D attention mechanism, which, despite its effectiveness in maintaining high levels of spatial and temporal coherence in generated videos, is extremely resource-intensive. The quadratic complexity of attention computations makes the task of generating even short-duration videos expensive in terms of computational resources and time. As highlighted in the paper, conventional 3D attention requires significant amounts of inference time, often limited by existing hardware capabilities like high-end GPUs.

Sliding Tile Attention: A Hardware-Efficient Approach

STA was developed to address the inefficiencies of existing attention mechanisms by introducing a tile-based approach to reduce computation costs while efficiently leveraging GPU capabilities. This method involves replacing token-wise sliding window attention with a tile-by-tile mechanism, which minimizes computational redundancy and increases hardware utilization. The technique focuses on local spatial-temporal windows, informed by insights that most attention within video data is localized.

Key Features of STA:

Locality-based Sparsity: Acknowledges the inherent redundancy in video data by localizing the attention process to spatially and temporally proximate areas.
Reduced Computational Redundancy: By using a sliding tile window approach rather than token-wise sliding, STA reduces unnecessary computations and achieves considerable speed-ups.
Hardware Optimization: Takes advantage of a hardware-aware design with optimizations at the kernel level to further enhance speed and memory efficiency. The utilization of efficient memory access patterns allows STA to leverage GPU architecture effectively.

Performance and Results

Empirical results demonstrate that STA significantly improves the efficiency of video generation models without compromising output quality. On models like HunyuanVideo, STA reduces the end-to-end latency by more than 1.35× compared to established methods, with quality remaining virtually unchanged. Furthermore, STA's efficiency becomes more pronounced with fine-tuning, achieving up to a 3.53× speedup.

Quantitative Results:

STA accelerates attention computation by 10-fold and reduces the end-to-end latency of the video generation process to approximately 268 seconds from 945 seconds when using traditional FlashAttention methods.
When applied to video models, STA significantly lowers inference time with minimal loss (0.09% drop) in benchmark quality measures like VBench.

Theoretical and Practical Implications

From a theoretical perspective, STA provides an elegant solution to the enduring challenge of high computational costs associated with 3D attention. It offers a pathway towards making high-resolution video generation more accessible and less resource-intensive. Practically, the reduced computational cost can lead to wider applicability in real-time video applications, potentially transforming fields that rely heavily on synthesized videos, such as virtual reality and augmented reality environments.

Future Trajectories

While the research successfully demonstrates the effectiveness of STA, it is crucial to explore its integration with other architectural innovations in model efficiency and scaling. Potential future developments could focus on synergizing STA with other model optimization techniques, such as reduced-precision arithmetic, or extending its applicability to other resource-constrained scenarios like mobile devices.

In conclusion, the Sliding Tile Attention method articulates a promising avenue for making high-quality video generation more efficient and scalable. It stands as a testament to the potential of aligning algorithmic innovation with hardware specifications for achieving substantial practical advancements in machine learning applications.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/TheTuringPost/status/1888919069170217267

https://twitter.com/haozhangml/status/1912563468928184603

https://twitter.com/arXivGPT/status/1889374777556738266

https://twitter.com/rohanpaul_ai/status/1892179081589834217

https://twitter.com/javaeeeee1/status/1888902246206738481

YouTube

Show All Videos