Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiTFastAttn: Attention Compression for Diffusion Transformer Models (2406.08552v2)

Published 12 Jun 2024 in cs.CV

Abstract: Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhihang Yuan (45 papers)
  2. Pu Lu (5 papers)
  3. Hanling Zhang (11 papers)
  4. Xuefei Ning (52 papers)
  5. Linfeng Zhang (160 papers)
  6. Tianchen Zhao (27 papers)
  7. Shengen Yan (26 papers)
  8. Guohao Dai (51 papers)
  9. Yu Wang (940 papers)
Citations (5)

Summary

  • The paper introduces DiTFastAttn, a novel post-training method that reduces redundant attention computations in diffusion transformer models.
  • It employs techniques like WA-RS, AST, and ASC to address spatial, temporal, and conditional redundancies without extensive retraining.
  • Experimental results demonstrate up to an 88% reduction in attention computation and a 1.6x speedup in high-resolution scenarios with little quality loss.

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Introduction

The paper "DiTFastAttn: Attention Compression for Diffusion Transformer Models" presents a novel approach aimed at addressing the computational inefficiencies inherent in diffusion transformers (DiTs), particularly focusing on the quadratic complexity of the self-attention mechanism. While DiTs excel in image and video generation tasks, their practical application is often hampered by substantial computational demands, especially at higher resolutions. This work introduces DiTFastAttn, a post-training model compression method, designed to mitigate these inefficiencies without requiring extensive retraining.

Key Contributions

The paper identifies three main redundancies in the attention computation of DiTs during the inference process:

  1. Spatial Redundancy: Many attention heads predominantly capture local spatial information. Consequently, attention values for distant tokens tend toward zero.
  2. Temporal Redundancy: Attention outputs across neighboring timesteps exhibit high similarity.
  3. Conditional Redundancy: Conditional and unconditional inferences present significant overlap in attention outputs.

To tackle these redundancies, the authors propose three corresponding techniques:

  1. Window Attention with Residual Caching (WA-RS): This method reduces spatial redundancy by employing window-based attention in certain layers and preserving long-range dependencies using cached residuals between full and window attention outputs.
  2. Attention Sharing across Timesteps (AST): This technique exploits the similarity between neighboring timesteps, reusing cached attention outputs to accelerate subsequent computations.
  3. Attention Sharing across CFG (ASC): By reusing attention outputs from conditional inference during unconditional inference in classifier-free guidance (CFG), this approach eliminates redundant computations.

Experimental Evaluation

The authors conducted extensive evaluations using multiple diffusion transformer models: DiT-2-XL-512, PixArt-Sigma-1024, and OpenSora. Key performance metrics include FID, IS, and CLIP score for image generation, alongside overall computational efficiency measured in FLOPs and latency.

Results:

  • For image generation tasks, DiTFastAttn demonstrated significant reductions in attention computation with minimal loss in generative quality. Notably, PixArt-Sigma-2K managed up to an 88% reduction in attention computation and up to a 1.6x speedup in high-resolution scenarios.
  • In video generation tasks using OpenSora, DiTFastAttn effectively reduced attention computation while maintaining visual quality, though aggressive compression configurations exhibited slight quality degradations.

Implications

Practical Implications: The ability to compress attention computations without retraining makes DiTFastAttn especially valuable for deploying DiTs in resource-constrained environments. This is critical for applications requiring real-time processing or when operating on edge devices with limited computational power.

Theoretical Implications: The paper contributes to a deeper understanding of redundancies in transformer models, potentially guiding future research focused on efficient architecture designs and further compression techniques for attention mechanisms.

Future Directions

The success of DiTFastAttn opens several avenues for further exploration:

  • Training-aware Compression Methods: Extending the current post-training approach to incorporate training-aware techniques could mitigate the performance drop observed in more aggressive compression settings.
  • Exploring Beyond Attention: While DiTFastAttn focuses on attention mechanisms, other components of the transformer architecture may also present opportunities for similar computational optimizations.
  • Kernel-level Optimizations: Enhancing the underlying kernel implementations could provide additional speedups, further improving the practicality of the approach.

Conclusion

DiTFastAttn presents a robust solution to the computational challenges faced by diffusion transformers in high-resolution image and video generation. By identifying and addressing specific redundancies within the attention mechanism, the proposed methods achieve substantial reductions in computation while maintaining output quality. These advancements hold promise for broader applications of DiTs, paving the way for more efficient and accessible generative models.