Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers (2405.02730v3)

Published 4 May 2024 in cs.CV

Abstract: Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention that bring further improvements despite a considerable amount of reduction in computation. Based on self-attention with downsampled tokens, we propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models. The proposed U-DiT could outperform DiT-XL/2 with only 1/6 of its computation cost. Codes are available at https://github.com/YuchuanTian/U-DiT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. All are worth words: A vit backbone for diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22669–22679. IEEE, 2023.
  2. Video generation models as world simulators. 2024.
  3. Efficientvit: Multi-scale linear attention for high-resolution dense prediction. arXiv preprint arXiv:2205.14756, 2022.
  4. Exploring vision transformers as diffusion learners. CoRR, abs/2212.13771, 2022.
  5. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, volume 12346 of Lecture Notes in Computer Science, pages 213–229. Springer, 2020.
  6. Pre-trained image processing transformer. CoRR, abs/2012.00364, 2020.
  7. Hat: Hybrid attention transformer for image restoration. arXiv preprint arXiv:2309.05239, 2023.
  8. Visionllama: A unified llama interface for vision tasks. CoRR, abs/2403.00522, 2024.
  9. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. CoRR, abs/2401.11605, 2024.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. IEEE Computer Society, 2009.
  11. Diffusion models beat gans on image synthesis. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 8780–8794, 2021.
  12. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020.
  14. Mdtv2: Masked diffusion transformer is a strong image synthesizer, 2024.
  15. CMT: convolutional neural networks meet vision transformers. CoRR, abs/2107.06263, 2021.
  16. Diffit: Diffusion vision transformers for image generation. CoRR, abs/2312.02139, 2023.
  17. Denoising diffusion probabilistic models. CoRR, abs/2006.11239, 2020.
  18. Efficient and explicit modelling of image hierarchies for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18278–18289, 2023.
  19. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 1833–1844, October 2021.
  20. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
  21. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, October 2021.
  22. Fit: Flexible vision transformer for diffusion model. CoRR, abs/2402.12376, 2024.
  23. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. CoRR, abs/2401.08740, 2024.
  24. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 4172–4182. IEEE, 2023.
  25. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022.
  26. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells III, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015.
  27. Freeu: Free lunch in diffusion u-net. CoRR, abs/2309.11497, 2023.
  28. Todo: Token downsampling for efficient generation of high-resolution images, 2024.
  29. Denoising diffusion implicit models. CoRR, abs/2010.02502, 2020.
  30. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  31. Ipt-v2: Efficient image processing transformer using hierarchical attentions, 2024.
  32. Attention is all you need. CoRR, abs/1706.03762, 2017.
  33. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. CoRR, abs/2102.12122, 2021.
  34. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17683–17693, 2022.
  35. A unified framework for u-net design and analysis. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  36. Restormer: Efficient transformer for high-resolution image restoration. In CVPR, 2022.
  37. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. CoRR, abs/2012.15840, 2020.
  38. Srformer: Permuted self-attention for single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12780–12791, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuchuan Tian (11 papers)
  2. Zhijun Tu (32 papers)
  3. Hanting Chen (52 papers)
  4. Jie Hu (187 papers)
  5. Chao Xu (283 papers)
  6. Yunhe Wang (145 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.