Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (2403.03206v1)

Published 5 Mar 2024 in cs.CV

Abstract: Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Introduction to Rectified Flow Models

Rectified Flow (RF) models have recently emerged as a potent approach for generative tasks, distinguishing themselves with their conceptual elegance and promising theoretical properties. These models formulate the generative process as traversing a straight path from data to noise, which, in theory, should streamline training and enhance sampling efficiency. However, despite their potential, RF models have not fully realized widespread application and performance validation in large-scale settings, particularly within the field of text-to-image synthesis. This paper addresses this gap by introducing novel techniques aimed at leveraging the full capabilities of RF models for high-resolution image generation tasks, in conjunction with cutting-edge architecture and data preprocessing methods.

Enhanced Noise Sampling in RF Models

The paper innovates in the domain of noise sampling for RF models by introducing a bias towards perceptually relevant scales. Through extensive experimentation, it is demonstrated that this re-weighted approach significantly outperforms traditional diffusion model formulations in the context of text-to-image synthesis. By optimizing noise sampling, the work showcases superior performance in generating high-fidelity images, marking a step forward in the practical application of RF models.

Novel Architectural Contributions

A novel architectural contribution of this research is the development of a transformer-based model that integrates separate weight streams for text and image modalities. This architecture facilitates a bidirectional exchange of information between text and imagery, enhancing the model's understanding and rendering of textual descriptions into images. The architecture's design allows for a predictable scaling behavior, correlating directly with improvements in text-to-image synthesis quality as assessed through a variety of metrics and human evaluations.

Large-Scale Evaluation and Findings

In a comprehensive paper, the performance of the proposed methods is extensively evaluated against state-of-the-art models. The findings indicate that the new RF models set new benchmarks in high-resolution text-to-image generation, outperforming existing models in quantitative evaluations and human preference ratings. The research provides a systematic exploration of different diffusion model and RF formulations, identifying the most effective strategies for text-to-image synthesis.

Moreover, the work explores simulation-free training methodologies for RF models, presenting practical and reliable objectives. It addresses the challenge of formulating a generative model that operates efficiently across varying resolutions and aspect ratios, presenting an adaptable approach to positional encoding and timestep adjustments based on resolution scaling.

Implications and Future Prospects

This research holds significant implications for the advancement of generative models, reinforcing the viability of RF models for complex, high-dimensional tasks like text-to-image synthesis. By pushing the boundaries of RF model performance and scalability, the paper sets a foundation for future explorations that could further unlock the potential of these models.

The exploration of model scaling opens new avenues for generating images and videos with increasing fidelity and complexity, suggesting that further scaling and methodological refinements could yield even more impressive outcomes. Additionally, the flexible use of text encoders offers practical insights into managing computational resources while maintaining high performance, a critical consideration for deploying AI models at scale.

In conclusion, this paper not only advances our understanding of RF models and their application to text-to-image synthesis but also prompts a reevaluation of current generative model benchmarks. By addressing both theoretical and practical challenges, the research paves the way for future developments in AI-driven, high-resolution image synthesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Ideogram v1.0 announcement, 2024. URL https://about.ideogram.ai/1.0.
  2. Playground v2.5 announcement, 2024. URL https://blog.playgroundai.com/playground-v2-5/.
  3. Building normalizing flows with stochastic interpolants, 2022.
  4. Logistic-normal distributions: Some properties and uses. Biometrika, 67(2):261–272, 1980.
  5. autofaiss. autofaiss, 2023. URL https://github.com/criteo/autofaiss.
  6. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers, 2022.
  7. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3), 2023.
  8. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  9. Align your latents: High-resolution video synthesis with latent diffusion models, 2023b.
  10. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18392–18402, 2023.
  11. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pp. 5253–5270, 2023.
  12. Quo vadis, action recognition? a new model and the kinetics dataset, 2018.
  13. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  3557–3567, 2021. URL https://api.semanticscholar.org/CorpusID:231951742.
  14. Bfloat16: The secret to high performance on cloud tpus, 2019. URL https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus?hl=en.
  15. Pixart-a: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  16. Neural ordinary differential equations. In Neural Information Processing Systems, 2018. URL https://api.semanticscholar.org/CorpusID:49310446.
  17. Reproducible scaling laws for contrastive language-image learning. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023. doi: 10.1109/cvpr52729.2023.00276. URL http://dx.doi.org/10.1109/CVPR52729.2023.00276.
  18. Emu: Enhancing image generation models using photogenic needles in a haystack, 2023.
  19. Flow matching in latent space, 2023.
  20. Scaling vision transformers to 22 billion parameters, 2023.
  21. Diffusion models beat gans on image synthesis, 2021.
  22. Score-based generative modeling with critically-damped langevin diffusion. arXiv preprint arXiv:2112.07068, 2021.
  23. Genie: Higher-order denoising diffusion solvers, 2022.
  24. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020.
  25. Structure and content-guided video synthesis with diffusion models, 2023.
  26. Euler, L. Institutionum calculi integralis. Number Bd. 1 in Institutionum calculi integralis. imp. Acad. imp. Saènt., 1768. URL https://books.google.de/books?id=Vg8OAAAAQAAJ.
  27. Boosting latent diffusion with flow matching. arXiv preprint arXiv:2312.07360, 2023.
  28. Geneval: An object-focused framework for evaluating text-to-image alignment. arXiv preprint arXiv:2310.11513, 2023.
  29. Photorealistic video generation with diffusion models, 2023.
  30. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.595. URL http://dx.doi.org/10.18653/v1/2021.emnlp-main.595.
  31. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2017.
  32. Classifier-free diffusion guidance, 2022.
  33. Denoising diffusion probabilistic models, 2020.
  34. Imagen video: High definition video generation with diffusion models, 2022.
  35. Simple diffusion: End-to-end diffusion for high resolution images, 2023.
  36. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. arXiv preprint arXiv:2307.06350, 2023.
  37. Hyvärinen, A. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res., 6:695–709, 2005. URL https://api.semanticscholar.org/CorpusID:1152227.
  38. Scaling laws for neural language models, 2020.
  39. Elucidating the design space of diffusion-based generative models. ArXiv, abs/2206.00364, 2022. URL https://api.semanticscholar.org/CorpusID:249240415.
  40. Analyzing and improving the training dynamics of diffusion models. arXiv preprint arXiv:2312.02696, 2023.
  41. Understanding diffusion objectives as the elbo with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  42. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
  43. Minimizing trajectory curvature of ode-based generative models, 2023.
  44. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  5404–5411, 2024.
  45. Microsoft COCO: Common Objects in Context, pp.  740–755. Springer International Publishing, 2014. ISBN 9783319106021. doi: 10.1007/978-3-319-10602-1˙48. URL http://dx.doi.org/10.1007/978-3-319-10602-1_48.
  46. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t.
  47. Flow straight and fast: Learning to generate and transfer data with rectified flow, 2022.
  48. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation, 2023.
  49. Fixing weight decay regularization in adam. ArXiv, abs/1711.05101, 2017. URL https://api.semanticscholar.org/CorpusID:3312944.
  50. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models, 2023.
  51. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers, 2024.
  52. Nichol, A. Dall-e 2 pre-training mitigations. https://openai.com/research/dall-e-2-pre-training-mitigations, 2022.
  53. Improved denoising diffusion probabilistic models, 2021.
  54. NovelAI. Novelai improvements on stable diffusion, 2022. URL https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac.
  55. Scalable diffusion models with transformers. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2023. doi: 10.1109/iccv51070.2023.00387. URL http://dx.doi.org/10.1109/ICCV51070.2023.00387.
  56. Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models, 2023.
  57. A self-supervised descriptor for image copy detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14532–14542, 2022.
  58. State of the art on diffusion models for visual computing. arXiv preprint arXiv:2310.07204, 2023.
  59. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
  60. Multisample flow matching: Straightening flows with minibatch couplings, 2023.
  61. Learning transferable visual models from natural language supervision, 2021.
  62. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290, 2023.
  63. Exploring the limits of transfer learning with a unified text-to-text transformer, 2019.
  64. Hierarchical text-conditional image generation with clip latents, 2022.
  65. High-resolution image synthesis with latent diffusion models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. doi: 10.1109/cvpr52688.2022.01042. URL http://dx.doi.org/10.1109/CVPR52688.2022.01042.
  66. U-Net: Convolutional Networks for Biomedical Image Segmentation, pp.  234–241. Springer International Publishing, 2015. ISBN 9783319245744. doi: 10.1007/978-3-319-24574-4˙28. URL http://dx.doi.org/10.1007/978-3-319-24574-4_28.
  67. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211 – 252, 2014. URL https://api.semanticscholar.org/CorpusID:2930547.
  68. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pp.  1–10, 2022a.
  69. Photorealistic text-to-image diffusion models with deep language understanding, 2022b.
  70. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2022c.
  71. Projected gans converge faster. Advances in Neural Information Processing Systems, 2021.
  72. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
  73. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023.
  74. Make-a-video: Text-to-video generation without text-video data, 2022.
  75. Deep unsupervised learning using nonequilibrium thermodynamics. ArXiv, abs/1503.03585, 2015. URL https://api.semanticscholar.org/CorpusID:14888175.
  76. Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6048–6058, 2023a.
  77. Understanding and mitigating copying in diffusion models. arXiv preprint arXiv:2305.20086, 2023b.
  78. Denoising diffusion implicit models, 2022.
  79. Generative modeling by estimating gradients of the data distribution, 2020.
  80. Score-based generative modeling through stochastic differential equations. ArXiv, abs/2011.13456, 2020. URL https://api.semanticscholar.org/CorpusID:227209335.
  81. Improving and generalizing flow-based generative models with minibatch optimal transport, 2023.
  82. Attention is all you need, 2017.
  83. Villani, C. Optimal transport: Old and new. 2008. URL https://api.semanticscholar.org/CorpusID:118347220.
  84. Vincent, P. A connection between score matching and denoising autoencoders. Neural Computation, 23:1661–1674, 2011. URL https://api.semanticscholar.org/CorpusID:5560643.
  85. Diffusion Model Alignment Using Direct Preference Optimization. arXiv:2311.12908, 2023.
  86. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  87. Small-scale proxies for large-scale transformer training instabilities, 2023.
  88. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. arXiv:2206.10789, 2022.
  89. Scaling vision transformers. In CVPR, pp.  12104–12113, 2022.
  90. Root mean square layer normalization, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Patrick Esser (17 papers)
  2. Sumith Kulal (8 papers)
  3. Andreas Blattmann (15 papers)
  4. Rahim Entezari (11 papers)
  5. Jonas Müller (28 papers)
  6. Harry Saini (3 papers)
  7. Yam Levi (3 papers)
  8. Dominik Lorenz (6 papers)
  9. Axel Sauer (14 papers)
  10. Frederic Boesel (3 papers)
  11. Dustin Podell (3 papers)
  12. Tim Dockhorn (13 papers)
  13. Zion English (4 papers)
  14. Kyle Lacey (3 papers)
  15. Alex Goodwin (1 paper)
  16. Yannik Marek (1 paper)
  17. Robin Rombach (24 papers)
Citations (476)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com