Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers (2505.14687v1)

Published 20 May 2025 in cs.CV

Abstract: Diffusion-based Transformers have demonstrated impressive generative capabilities, but their high computational costs hinder practical deployment, for example, generating an $8192\times 8192$ image can take over an hour on an A100 GPU. In this work, we propose GRAT (\textbf{GR}ouping first, \textbf{AT}tending smartly), a training-free attention acceleration strategy for fast image and video generation without compromising output quality. The key insight is to exploit the inherent sparsity in learned attention maps (which tend to be locally focused) in pretrained Diffusion Transformers and leverage better GPU parallelism. Specifically, GRAT first partitions contiguous tokens into non-overlapping groups, aligning both with GPU execution patterns and the local attention structures learned in pretrained generative Transformers. It then accelerates attention by having all query tokens within the same group share a common set of attendable key and value tokens. These key and value tokens are further restricted to structured regions, such as surrounding blocks or criss-cross regions, significantly reducing computational overhead (e.g., attaining a \textbf{35.8$\times$} speedup over full attention when generating $8192\times 8192$ images) while preserving essential attention patterns and long-range context. We validate GRAT on pretrained Flux and HunyuanVideo for image and video generation, respectively. In both cases, GRAT achieves substantially faster inference without any fine-tuning, while maintaining the performance of full attention. We hope GRAT will inspire future research on accelerating Diffusion Transformers for scalable visual generation.

Summary

Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers

The paper introduces GRAT (Grouping First, Attending Smartly), a novel training-free attention acceleration strategy designed to enhance the efficiency of Diffusion Transformers in generating high-resolution images and videos. This approach tackles the computational bottleneck of self-attention, which traditionally scales quadratically with sequence length, posing significant practical deployment challenges due to increased inference latency and resource consumption.

Methodology

GRAT's framework consists of two key components: Grouping First and Attending Smartly. The Grouping First phase involves partitioning input tokens into non-overlapping groups that align with GPU execution patterns, leveraging local attention structures common to pretrained Diffusion Transformers. The Attending Smartly phase optimizes attention computation by having all query tokens within the same group share a restricted set of attendable key and value tokens, confined to structured regions such as surrounding blocks or criss-cross patterns. This not only ensures preservation of essential attention patterns and context but also significantly reduces computational overhead.

Performance Evaluation

The paper evaluates GRAT's effectiveness using Flux and HunyuanVideo models for image and video generation, respectively. GRAT achieved substantial real-world speedups—up to 35.8 times faster than full attention for image generation—while maintaining the visual quality comparable to methods utilizing full self-attention, as demonstrated by its competitive performance on GenEval scores. Similarly, GRAT enhanced inference speed in video generation by a factor of 15.8 times over full attention, outperforming the intricate STA mechanism in both speed and sample quality.

Implications and Future Research

GRAT presents a promising step towards efficient, scalable deployment of Diffusion Transformers in computationally constrained environments and latency-sensitive applications. The work encourages future exploration of adaptive grouping strategies and dynamic attention mechanisms to further enhance adaptability across diverse tasks and input types. Additionally, the substantial inference speedups offered by GRAT without compromising sample quality suggest practical applications in real-time processing and creative industries where high-quality visual generation is required.

Furthermore, GRAT's methodology highlights the potential in leveraging structured sparsity within attention mechanisms to minimize computational load, opening avenues for more sustainable AI practices through reductions in energy consumption associated with large-scale model inference. As such, this paper not only advances immediate technological capabilities in visual generation but also contributes to broader efforts towards environmentally conscious AI development.

Related Papers

Find Related Papers

Tweets

https://twitter.com/Shauray7/status/1937368316739481845

YouTube

Show All Videos

HackerNews

Training-Free Acceleration for Diffusion Transformers (3 points, 1 comment)