Dynamic Diffusion Transformer (2410.03456v2)

Published 4 Oct 2024 in cs.CV

Abstract: Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify the superiority of DyDiT. Notably, with <3% additional fine-tuning iterations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet. The code is publicly available at https://github.com/NUS-HPC-AI-Lab/ Dynamic-Diffusion-Transformer.

Authors (8)

Wangbo Zhao (25 papers)
Yizeng Han (33 papers)
Jiasheng Tang (16 papers)
Kai Wang (624 papers)
Yibing Song (65 papers)
Gao Huang (178 papers)
Fan Wang (313 papers)
Yang You (173 papers)

Citations (2)

View on Semantic Scholar

Summary

Dynamic Diffusion Transformer: Enhancing Efficiency in Image Generation

This paper explores the concept of the Dynamic Diffusion Transformer (DyDiT), a novel architecture tailored to optimize computational resources and reduce redundancy in image generation tasks. The foundation of DyDiT challenges the static inference paradigm typically observed in Diffusion Transformers (DiT), introducing adaptability in both timestep and spatial dimensions during the generation process. The traditional fixed architecture employed by DiT models results in excessive computational cost, often falling on unnecessary or redundant calculations when the model faces differing complexities associated with diffusion timesteps and varying spatial regions.

Key Innovations and Technical Details

In DyDiT, two primary mechanisms are implemented: the Timestep-wise Dynamic Width (TDW) and Spatial-wise Dynamic Token (SDT) strategies. At its core, DyDiT aims to allocate computational resources based on the task's complexity dynamically:

Timestep-wise Dynamic Width (TDW): This method adjusts the attention and MLP block widths depending on the current diffusion timestep. It is based on the observation that as generation progresses towards its earlier timesteps, the prediction tasks become easier because these phases approach the prior distribution. Consequently, smaller, less complex models can handle predictions effectively, bringing down unnecessary computational cost across particular timesteps.
Spatial-wise Dynamic Token (SDT): The SDT aims to mitigate redundant computation by bypassing complex computational blocks for image patches where prediction tasks are simpler. Such adaptability allows specific image areas, such as uniform or background regions with low noise, to use reduced computational paths, improving efficiency. Importantly, both TWD and SDT are designed as flexible, plug-and-play modules that can seamlessly integrate with existing DiT architectures.

Experimental Results and Implications

Extensive experimentation verifies the efficacy of DyDiT. Compared to the traditional DiT-XL, the DyDiT-XL reduces FLOPs by 51%, accelerates generation by a factor of 1.73×, while maintaining a robust FID score of 2.07 on the ImageNet dataset. This performance underscores the potential of DyDiT in optimizing image generation tasks without a significant loss in image quality. Additionally, experimental setups demonstrate DyDiT's compatibility with large model architectures, supporting its scalability and applicability to more complex tasks.

Compatibility with Other Methods

DyDiT's dynamic architecture complements sampler-based efficient methods like DPM Solver++ and global acceleration techniques such as DeepCache, indicating its potential for integration with a wide range of optimization approaches. Such flexibility highlights DyDiT's capability to push the boundaries of computational efficiency in practical scenarios beyond theoretical performance gains.

Future Directions

Given DyDiT's promising outcomes, further exploration into its applications in high-resolution image generation, video generation, and text-to-image generation outputs could enhance its utility across diverse domains. Additionally, developing dynamic training strategies for distillation-based samplers could provide an avenue to further streamline computational efficiency. Future research could delve into applying DyDiT principles to broader tasks or in different domains where computational efficiency is paramount without compromising on the quality, such as real-time video processing and other multimedia content generation.

This work effectively instigates a shift towards more resource-efficient generation models while challenging the necessity of static model architectures. DyDiT's plug-and-play modules serve as a paradigm shift, optimizing computational processes in an increasingly demanding landscape.

PDF Markdown

Related Papers

GitHub

NUS HPC AI Lab · GitHub

Tweets

https://twitter.com/ZiebaMat/status/1843652575297712289

YouTube

Show All Videos