Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

640 1 21

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (2403.03206v1)

Published 5 Mar 2024 in cs.CV

Abstract: Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

PDF HTML Abstract

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Introduction to Rectified Flow Models

Rectified Flow (RF) models have recently emerged as a potent approach for generative tasks, distinguishing themselves with their conceptual elegance and promising theoretical properties. These models formulate the generative process as traversing a straight path from data to noise, which, in theory, should streamline training and enhance sampling efficiency. However, despite their potential, RF models have not fully realized widespread application and performance validation in large-scale settings, particularly within the field of text-to-image synthesis. This paper addresses this gap by introducing novel techniques aimed at leveraging the full capabilities of RF models for high-resolution image generation tasks, in conjunction with cutting-edge architecture and data preprocessing methods.

Enhanced Noise Sampling in RF Models

The paper innovates in the domain of noise sampling for RF models by introducing a bias towards perceptually relevant scales. Through extensive experimentation, it is demonstrated that this re-weighted approach significantly outperforms traditional diffusion model formulations in the context of text-to-image synthesis. By optimizing noise sampling, the work showcases superior performance in generating high-fidelity images, marking a step forward in the practical application of RF models.

Novel Architectural Contributions

A novel architectural contribution of this research is the development of a transformer-based model that integrates separate weight streams for text and image modalities. This architecture facilitates a bidirectional exchange of information between text and imagery, enhancing the model's understanding and rendering of textual descriptions into images. The architecture's design allows for a predictable scaling behavior, correlating directly with improvements in text-to-image synthesis quality as assessed through a variety of metrics and human evaluations.

Large-Scale Evaluation and Findings

In a comprehensive paper, the performance of the proposed methods is extensively evaluated against state-of-the-art models. The findings indicate that the new RF models set new benchmarks in high-resolution text-to-image generation, outperforming existing models in quantitative evaluations and human preference ratings. The research provides a systematic exploration of different diffusion model and RF formulations, identifying the most effective strategies for text-to-image synthesis.

Moreover, the work explores simulation-free training methodologies for RF models, presenting practical and reliable objectives. It addresses the challenge of formulating a generative model that operates efficiently across varying resolutions and aspect ratios, presenting an adaptable approach to positional encoding and timestep adjustments based on resolution scaling.

Implications and Future Prospects

This research holds significant implications for the advancement of generative models, reinforcing the viability of RF models for complex, high-dimensional tasks like text-to-image synthesis. By pushing the boundaries of RF model performance and scalability, the paper sets a foundation for future explorations that could further unlock the potential of these models.

The exploration of model scaling opens new avenues for generating images and videos with increasing fidelity and complexity, suggesting that further scaling and methodological refinements could yield even more impressive outcomes. Additionally, the flexible use of text encoders offers practical insights into managing computational resources while maintaining high performance, a critical consideration for deploying AI models at scale.

In conclusion, this paper not only advances our understanding of RF models and their application to text-to-image synthesis but also prompts a reevaluation of current generative model benchmarks. By addressing both theoretical and practical challenges, the research paves the way for future developments in AI-driven, high-resolution image synthesis.

PDF Markdown Bookmark Chat (Pro)

References (90)

Authors (17)

Patrick Esser (17 papers)
Sumith Kulal (8 papers)
Andreas Blattmann (15 papers)
Rahim Entezari (11 papers)
Jonas Müller (28 papers)
Harry Saini (3 papers)
Yam Levi (3 papers)
Dominik Lorenz (6 papers)
Axel Sauer (14 papers)
Frederic Boesel (3 papers)
Dustin Podell (3 papers)
Tim Dockhorn (13 papers)
Zion English (4 papers)
Kyle Lacey (3 papers)
Alex Goodwin (1 paper)
Yannik Marek (1 paper)
Robin Rombach (24 papers)

Citations (476)

View on Semantic Scholar

Tweets

https://twitter.com/robrombach/status/1765351811345481863

https://twitter.com/EMostaque/status/1769065994167652678

https://twitter.com/robrombach/status/1816032490336801190

https://twitter.com/kadirnar_ai/status/1800918095139147833

https://twitter.com/hillbig/status/1765506134503571504

https://twitter.com/fly51fly/status/1765496285778632738

YouTube

Show All Videos

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (1 point, 0 comments)