Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task (2409.04005v2)

Published 6 Sep 2024 in cs.CV

Abstract: The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 49% reduction compared to DiT and a 34% reduction compared to PixArt-$\alpha$). The visual exhibition and source code of Qihoo-T2X is available at https://360cvgroup.github.io/Qihoo-T2X/.

PDF Abstract

Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task

The paper "Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task" addresses the computational challenges posed by global self-attention mechanisms in diffusion transformers, specifically when applied to high-resolution image and video generation tasks.

Introduction and Background

Recent advancements in diffusion models such as Sora, Stable Diffusion 3, and Flux have shown that transformer-based architectures can achieve exceptional performance in creating photorealistic images and videos. However, the quadratic computational complexity of global self-attention concerning the sequence length presents significant challenges, particularly when scaling up for higher resolution and video tasks.

Methodology

The authors introduce the Proxy Token Diffusion Transformer (PT-DiT), which leverages sparse representative token attention to efficiently model global visual information, thereby reducing computational complexity while maintaining competitive performance. PT-DiT employs a novel proxy token mechanism where one token is sampled from each spatial-temporal window to act as a proxy for global self-attention. This contrasts with the traditional approach where the complexity explodes quadratically with the sequence length.

Global Information Interaction Module (GIIM)

The GIIM is a crucial part of PT-DiT. Latent tokens are sampled to form a set of proxy tokens that represent localized spatial and temporal regions. These proxy tokens undergo self-attention to capture global semantics, which are then infused back into the all latent tokens via cross-attention. This selective attention significantly reduces redundant computations inherent in global self-attention.

Texture Complement Module (TCM)

To compensate for potential detail loss due to sparse attention, the TCM is introduced, incorporating window and shift-window attention mechanisms akin to the Swin Transformer. These localized attention mechanisms enhance the model's ability to capture fine-grained details, thereby improving texture and overall visual quality.

Computational Complexity Analysis

The authors provide a thorough complexity analysis, showing that PT-DiT achieves substantial reductions in computational overhead. For instance, at resolutions of 512 and 2048, PT-DiT reduces the computational complexity by 82.0\% and 82.5\% respectively compared to DiT. This sparseness in token interactions also offers significant computational savings in video generation tasks, where longer sequences are typical.

Empirical Results

The experimental results highlight PT-DiT's efficacy:

In the image generation task, PT-DiT's computational complexity is 48.6\% less than SiT at a resolution of 2048.
For video generation, PT-DiT exhibits superior computational efficiency compared to both EasyAnimate and CogVideoX, while also scaling up to higher resolutions without prohibitive memory demands.

The superior performance of PT-DiT is further evidenced by its ability to train high-resolution images (2048×2048) and video frames (512×512×288) on a 64GB Ascend 910B, demonstrating both its efficiency and scalability.

Conclusion

The proposed PT-DiT method not only exemplifies a significant leap in addressing computational inefficiencies in diffusion transformers but also sets a new benchmark in the efficiency-focused design of text-to-any-task transformers. The Qihoo-T2X series built upon PT-DiT, encompassing T2I, T2V, and T2MV models, demonstrates the versatility and practicality of this approach in various generational tasks.

Implications and Future Work

The theoretical and practical implications of this research are profound. By addressing the redundancy inherent in global self-attention and proposing a more efficient way to model global semantics, PT-DiT opens new avenues for scaling up diffusion transformers to handle even more complex and higher-resolution tasks. Future developments could explore further optimizations in token sampling strategies and extending this approach to other domains within AI.

In conclusion, the PT-DiT and the Qihoo-T2X series present a promising direction in achieving computationally efficient diffusion transformers, making the generation of high-resolution images and videos more feasible and practical for a wider range of applications.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Jing Wang (740 papers)
Ao Ma (19 papers)
Jiasong Feng (5 papers)
Dawei Leng (19 papers)
Yuhui Yin (17 papers)
Xiaodan Liang (318 papers)

Related Papers

Find Related Papers

GitHub

GitHub - 360CVGroup/Qihoo-T2X: This is the official reproduction of Qihoo-T2X. (217 stars)

Tweets

https://twitter.com/_akhaliq/status/1832982014359736648

https://twitter.com/CSVisionPapers/status/1833172561506566331