Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GenTron: Diffusion Transformers for Image and Video Generation (2312.04557v2)

Published 7 Dec 2023 in cs.CV

Abstract: In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.

The paper presents an in-depth paper of transformer-based diffusion models for text-conditioned image and video synthesis. The work builds on prior transformer diffusion architectures by extending them beyond class-conditioned scenarios to free-form text conditioning, and further scaling the model capacity significantly. Detailed empirical evaluations underscore the advantages of transformer designs over traditional convolutional U-Net backbones in the context of generative modeling, with particular emphasis on compositionality and spatial accuracy.

The paper addresses several key aspects:

  • Text Conditioning Mechanisms:

The paper transitions from the conventional class-based conditioning (as used in earlier diffusion models) to free-form text conditioning. It rigorously investigates different conditioning strategies, comparing adaptive layer normalization (adaLN) and cross-attention. The results indicate that while adaLN is effective for static, limited signal cases, cross-attention is markedly superior for handling the spatially heterogeneous information present in natural language descriptions. The authors also perform a comparative paper using various text encoder models, including those from multimodal models like CLIP and models from pure language processing such as Flan-T5 (Flan-T5 LLM), as well as their combination, to leverage complementary strengths in providing robust textual guidance.

  • Scaling Transformer-Based Diffusion:

A core contribution is the exploration of scaling transformer-based diffusion models. The baseline model, derived from DiT, is scaled from approximately 900 million parameters to over 3 billion parameters by increasing the model depth (number of transformer blocks), width (embedding dimensions), and MLP hidden dimensions. Quantitative evaluations on compositional benchmarks highlight consistent improvements across metrics such as attribute binding (color, shape, texture) and object relationships. Qualitative comparisons also reveal significant refinement in the spatial layout and fine details, particularly for complex prompts.

  • Extension to Text-to-Video Generation:

The authors extend the framework to video synthesis by incorporating temporal modeling within each transformer block. A lightweight temporal self-attention layer is interleaved between the cross-attention and MLP modules, and the model is further adapted via joint image-video training. A novel concept called motion-free guidance (MFG) is introduced. Inspired by classifier-free guidance, MFG intermittently replaces the temporal self-attention mask with an identity matrix with a preset probability. This effectively disables motion modeling during selected training steps, thereby preserving per-frame visual quality while still capturing coherent temporal dynamics during video generation. The inference process also involves a modified score estimate where distinct guidance scales control text and motion conditions independently.

  • Empirical and Comparative Evaluation:

Extensive experiments demonstrate that the cross-attention mechanism and the joint use of CLIP and T5-based text embeddings yield superior performance compared to alternative approaches. The scaled model (over 3B parameters) shows robust improvements on benchmarks like T2I-CompBench, with a significant margin in compositional tasks. Human studies reveal that the proposed method attains higher win rates in visual quality and text alignment compared to competitive baselines such as SDXL. The paper also provides detailed ablations showing that motion-free guidance leads to enhanced focus on the key objects mentioned in prompts and improved overall video quality.

  • Implementation Details and Training Strategies:

The training procedure is multi-staged, starting with low-resolution image generation and progressively moving to higher resolutions. The methodology uses advanced parallelization techniques like Fully Sharded Data Parallel (FSDP) and activation checkpointing to manage the increased computational requirements of the larger model. The video training leverages a large-scale dataset with joint image-video samples to counterbalance the limited quality and quantity of available video data. Furthermore, the paper details theoretical formulations of the diffusion process, including equations for the forward and reverse diffusion steps, and the incorporation of temporal self-attention, providing mathematical clarity on the underlying processes.

Overall, the paper offers a comprehensive exploration of how transformer architectures can be effectively integrated with diffusion models for both image and video generation. It not only illustrates the scalability advantages of transformer-based models in this domain but also pioneers strategiesβ€”such as motion-free guidanceβ€”that mitigate inherent challenges in video synthesis while maintaining high visual fidelity and compositional consistency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  3. All are worth words: A vit backbone for diffusion models. In CVPR, 2023.
  4. Improving image generation with better captions, 2023.
  5. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  6. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
  9. Pixart-α𝛼\alphaitalic_Ξ±: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023b.
  10. PaLI: A jointly-scaled multilingual language-image model. In The Eleventh International Conference on Learning Representations, 2023c.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  13. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
  14. Jointly trained image and video generation using residual vectors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3028–3042, 2020.
  15. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
  16. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  17. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  19. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  20. Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023.
  21. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  22. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022a.
  23. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022b.
  24. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  25. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  26. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  27. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022b.
  28. Video diffusion models, 2022c.
  29. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  30. Openclip, 2021. If you use this software, please cite it as below.
  31. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  32. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  33. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  34. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  35. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  36. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  37. Vdt: An empirical study on video diffusion with transformers. arXiv preprint arXiv:2305.13311, 2023.
  38. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
  39. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  40. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the 39th International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
  41. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  42. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, 2018.
  43. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  44. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  45. Improving language understanding by generative pre-training. OpenAI blog, 2018.
  46. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  47. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  48. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  49. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
  50. Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. In International Conference on Learning Representations, 2021.
  51. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  52. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  53. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  54. Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
  55. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  56. Deepfloyd if: A novel state-of-the-art open-source text-to-image model. https://github.com/deep-floyd/IF, 2023.
  57. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2023.
  58. Codefusion: A pre-trained diffusion model for code generation. arXiv preprint arXiv:2310.17680, 2023.
  59. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  60. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  61. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  62. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
  63. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023b.
  64. Dolfin: Diffusion layout transformers without autoencoder. arXiv preprint arXiv:2310.16305, 2023c.
  65. RAPHAEL: Text-to-image generation via large mixture of diffusion paths. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  66. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022. Featured Certification.
  67. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12104–12113, 2022.
  68. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  69. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.
  70. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Shoufa Chen (22 papers)
  2. Mengmeng Xu (27 papers)
  3. Jiawei Ren (33 papers)
  4. Yuren Cong (11 papers)
  5. Sen He (29 papers)
  6. Yanping Xie (3 papers)
  7. Animesh Sinha (14 papers)
  8. Ping Luo (340 papers)
  9. Tao Xiang (324 papers)
  10. Juan-Manuel Perez-Rua (23 papers)
Citations (22)
X Twitter Logo Streamline Icon: https://streamlinehq.com