InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation (2309.06380v2)

Published 12 Sep 2023 in cs.LG and cs.CV

Abstract: Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to $22.4$. We call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow yields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ second regime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Codes and pre-trained models are available at \url{github.com/gnobitab/InstaFlow}.

Authors (5)

Xingchao Liu (28 papers)
Xiwen Zhang (27 papers)
Jianzhu Ma (48 papers)
Jian Peng (101 papers)
Qiang Liu (405 papers)

Citations (131)

View on Semantic Scholar

Summary

Overview of "InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation"

The paper presents a substantial advancement in diffusion-based text-to-image generation, introducing InstaFlow, an efficient one-step model derived from Stable Diffusion. By leveraging innovative concepts like Rectified Flow and reflow procedures, the authors successfully address the notorious challenge of slow inference in generative diffusion models.

Key Contributions

The research investigates the potential of accelerating text-to-image generation through the Rectified Flow methodology. Here are the primary highlights:

One-Step Model Efficiency: InstaFlow refines the typical multi-step diffusion model to achieve comparable image quality in just one inference step, marking a significant reduction in computational demands.
Numerical Performance: The model achieves remarkable results, with an FID score of 23.3 on the MS COCO 2017-5k dataset, outperforming progressive distillation approaches (37.2 to 23.3 FID) at much smaller computational costs of 199 A100 GPU days.
Scalability: By expanding the neural network size to 1.7B parameters, the authors further lowered the FID to 22.4, showcasing the model's enhanced consistency and quality.
Practical Applications: InstaFlow demonstrates its utility in quickly generating high-quality previews for later refinement by more computationally expensive models like SDXL-Refiner, providing a pragmatic workflow for real-world applications.

Methodology

The authors employ and expand the Rectified Flow approach, which enhances the coupling of noise and image distributions:

Reflow Procedure: This process straightens the probability flow trajectories, allowing for more accurate, faster simulations with fewer inference steps.
Text-Conditioned Refinement: The model integrates textual conditioning to refine trajectories and improve semantic alignment between textual prompts and generated images.
Distillation and Reverse Flow: The one-step model is distilled using an efficient student-teacher framework, ensuring accurate approximation of the original high-quality diffusion outputs.

Implications and Future Directions

The implications of InstaFlow are profound for the field of AI, particularly in text-to-image synthesis:

Reduction in Resources: The ability to produce high-fidelity images rapidly with lower resource expenditure is a considerable advantage, enabling wider accessibility and scalability.
Streamlined Models: With its fast execution and high output quality, InstaFlow provides a template for refining other types of large-scale generative models.
Potential for Expansion: Future developments might explore further scaling, integration with other generative models like GANs, or hybrid techniques to push boundaries in image fidelity and creative control.

Ultimately, InstaFlow signifies a notable step forward in the quest to balance efficiency and quality in generative AI models, poised to impact various application domains ranging from creative industries to automated content generation systems.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - gnobitab/InstaFlow: :zap: InstaFlow! One-Step Stable Diffusion with Rectified Flow (ICLR 2024) (1,130 stars)

Tweets

https://twitter.com/sang_yun_lee/status/1766911934727655607

https://twitter.com/LeeLeepenkman/status/1918215721613705456

https://twitter.com/jonathandinu/status/1752840277255983550

YouTube

Show All Videos