Accelerating Vision Diffusion Transformers with Skip Branches (2411.17616v2)

Published 26 Nov 2024 in cs.CV

Abstract: Diffusion Transformers (DiT), an emerging image and video generation model architecture, has demonstrated great potential because of its high generation quality and scalability properties. Despite the impressive performance, its practical deployment is constrained by computational complexity and redundancy in the sequential denoising process. While feature caching across timesteps has proven effective in accelerating diffusion models, its application to DiT is limited by fundamental architectural differences from U-Net-based approaches. Through empirical analysis of DiT feature dynamics, we identify that significant feature variation between DiT blocks presents a key challenge for feature reusability. To address this, we convert standard DiT into Skip-DiT with skip branches to enhance feature smoothness. Further, we introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time. We validated effectiveness of our proposal on different DiT backbones for video and image generation, showcasing skip branches to help preserve generation quality and achieve higher speedup. Experimental results indicate that Skip-DiT achieves a 1.5x speedup almost for free and a 2.2x speedup with only a minor reduction in quantitative metrics. Code is available at https://github.com/OpenSparseLLMs/Skip-DiT.git.

Authors (5)

Guanjie Chen (4 papers)
Xinyu Zhao (54 papers)
Yucheng Zhou (37 papers)
Tianlong Chen (202 papers)
Yu Cheng (354 papers)

Summary

Accelerating Vision Diffusion Transformers with Skip Branches

The paper "Accelerating Vision Diffusion Transformers with Skip Branches" introduces a novel approach to enhance the efficiency of Diffusion Transformers (DiT) in the field of image and video generation. The authors propose a method called Skip-DiT, which incorporates skip branches to smoothen the dynamic features of DiT and subsequently leverages these branches through a caching mechanism known as Skip-Cache. This approach aims to mitigate the computational intensity traditionally associated with DiT models, particularly concerning the sequential denoising step, thereby facilitating faster inference while maintaining output quality.

Key Contributions

The paper makes several noteworthy contributions to the field of visual generation models:

Feature Smoothness Identification: The paper identifies feature smoothness as a critical determinant in the effectiveness of caching mechanisms within DiT architectures. To optimize this aspect, the authors suggest integrating skip branches to enhance the feature consistency across timesteps, enabling more efficient inference caching.
Skip-DiT Architecture: The introduction of skip branches in the DiT architecture serves to maintain high-level feature information throughout the network, preventing the substantial variance often encountered in traditional models. The Skip-DiT model is capable of producing smoother features, which are more amenable to caching, thereby improving processing speed without compromising quality.
Skip-Cache Implementation: By using the skip branches, Skip-Cache facilitates the caching of features across timesteps, reducing computational redundancy. This method allows for significant acceleration of the inference process, achieving up to 2.2× speedup in specific scenarios.
Empirical Evaluations: Extensive experiments conducted across various DiT backbones for both image and video generation tasks showcase Skip-DiT's superiority over existing approaches. The model not only achieves substantial speedup but also maintains, if not enhances, the quality of generation as measured by FID, FVD, VBench, PSNR, LPIPS, and SSIM metrics.

Experimental Analysis

The experiments confirm the viability of Skip-DiT and Skip-Cache in multiple scenarios, demonstrating substantial improvements over baseline models and existing caching methods. For class-to-video and text-to-video generation tasks, as well as text-to-image generation, Skip-DiT consistently outperformed traditional models with a marked reduction in latency:

Class-to-video Tasks: Across datasets like UCF101 and Taichi, Skip-DiT managed to achieve a notable reduction in FVD scores. Even with extended caching steps, the model retained its competitive edge in terms of FID, demonstrating its robustness under varied settings.
Text-to-video and Text-to-image Tasks: In tasks that involve complex textual conditions, Skip-DiT showed superior performance with only minor deviations in perceptual similarity metrics, indicating that the integration of skip branches enhances DiT's applicability in more sophisticated scenarios.

Implications and Future Work

The introduction of skip branches in vision diffusion transformers presents promising avenues for future research, particularly in making visual generation models more computationally efficient and scalable. The implication of such advancements is significant for applications where real-time processing is crucial, such as interactive media and virtual environments.

Future work could explore the potential integration of Skip-DiT with other model architectures beyond DiT, perhaps extending to other forms of generative models that suffer from similar computational bottlenecks. Additionally, refining the training process to further capitalize on the benefits of feature smoothness could enhance model performance and usability in diverse application domains.

In conclusion, this paper contributes a meaningful advancement in the optimization of diffusion models, positing skip branches as a pivotal element for accelerating inference without sacrificing quality, thereby broadening the scope of efficient visual data generation.

PDF Markdown

Related Papers

GitHub

GitHub - OpenSparseLLMs/Skip-DiT: PyTorch implementation of the paper 'Accelerating Vision Diffusion Transformers with Skip Branches'. arxiv: https://arxiv.org/abs/2411.17616 (18 stars)

Tweets

https://twitter.com/GJChen11710/status/1861681197245702314