U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers (2405.02730v3)

Published 4 May 2024 in cs.CV

Abstract: Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention that bring further improvements despite a considerable amount of reduction in computation. Based on self-attention with downsampled tokens, we propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models. The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its computation cost. Codes are available at https://github.com/YuchuanTian/U-DiT.

References (38)

Authors (6)

Yuchuan Tian (11 papers)
Zhijun Tu (32 papers)
Hanting Chen (52 papers)
Jie Hu (187 papers)
Chao Xu (283 papers)
Yunhe Wang (145 papers)

Citations (6)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/miru_why/status/1796586852583100445

https://twitter.com/KeunhongP/status/1911644717810655233

https://twitter.com/CSVisionPapers/status/1787917066302796101

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers (2405.02730v3)

Summary

Related Papers

Tweets