Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching (2410.23788v1)

Published 31 Oct 2024 in cs.CV and cs.AI

Abstract: Transformer-based Diffusion Probabilistic Models (DPMs) have shown more potential than CNN-based DPMs, yet their extensive computational requirements hinder widespread practical applications. To reduce the computation budget of transformer-based DPMs, this work proposes the Efficient Diffusion Transformer (EDT) framework. The framework includes a lightweight-design diffusion model architecture, and a training-free Attention Modulation Matrix and its alternation arrangement in EDT inspired by human-like sketching. Additionally, we propose a token relation-enhanced masking training strategy tailored explicitly for EDT to augment its token relation learning capability. Our extensive experiments demonstrate the efficacy of EDT. The EDT framework reduces training and inference costs and surpasses existing transformer-based diffusion models in image synthesis performance, thereby achieving a significant overall enhancement. With lower FID, EDT-S, EDT-B, and EDT-XL attained speed-ups of 3.93x, 2.84x, and 1.92x respectively in the training phase, and 2.29x, 2.29x, and 2.22x respectively in inference, compared to the corresponding sizes of MDTv2. The source code is released at https://github.com/xinwangChen/EDT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  2. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  3. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  4. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  5. Scaling diffusion policy in transformer to 1 billion parameters for robotic manipulation. arXiv preprint arXiv:2409.14411, 2024.
  6. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  7. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  8. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  9. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  10. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  11. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  12. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  13. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  16. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  17. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890, 2021.
  18. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  19. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  20. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
  21. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  22. Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23164–23173, 2023.
  23. D Man and A Vision. A computational investigation into the human representation and processing of visual information. WH San Francisco: Freeman and Company, San Francisco, 1:1, 1982.
  24. Amplifying the mind’s eye: sketching and visual cognition. Leonardo, 23(1):117–126, 1990.
  25. How do humans sketch objects? ACM Transactions on graphics (TOG), 31(4):1–10, 2012.
  26. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  27. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  28. Aaron Hertzmann. Toward modeling creative processes for algorithmic painting. arXiv preprint arXiv:2205.01605, 2022.
  29. What are the stages of the creative process? what visual art students are saying. Frontiers in psychology, 9:389647, 2018.
  30. Creativity in the design process: co-evolution of problem–solution. Design studies, 22(5):425–437, 2001.
  31. Geoffrey M Boynton. Attention and visual perception. Current opinion in neurobiology, 15(4):465–469, 2005.
  32. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  33. Mdtv2: Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
  34. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. arXiv preprint arXiv:2208.06677, 2022.
  35. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  36. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  37. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  38. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019.
  39. Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer. arXiv preprint arXiv:2403.17004, 2024.
  40. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. arXiv preprint arXiv:2401.11605, 2024.
  41. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  42. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  43. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023.
  44. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.
  45. Toddlerdiffusion: Flash interpretable controllable diffusion model. arXiv preprint arXiv:2311.14542, 2023.
  46. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  47. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022.
  48. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, pages 205–218. Springer, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xinwang Chen (2 papers)
  2. Ning Liu (199 papers)
  3. Yichen Zhu (51 papers)
  4. Feifei Feng (23 papers)
  5. Jian Tang (327 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets