On the Scalability of Diffusion-based Text-to-Image Generation (2404.02883v1)

Published 3 Apr 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work, we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute and dataset size.

References (42)

Citations (14)

View on Semantic Scholar

Summary

The paper demonstrates that optimized UNet designs with improved cross-attention yield superior performance when scaled across different model sizes.
It employs empirical analysis by training models with 0.4B to 4B parameters and datasets up to 600M images to explore scaling dynamics.
Enhanced training data quality, particularly in caption diversity, significantly improves text-image alignment and informs efficient model scaling strategies.

On the Scalability of Diffusion-based Text-to-Image Generation

The paper discusses the scalability of diffusion-based text-to-image (T2I) generation models, an area that has not been fully explored within the field of diffusion models. The authors focus on the scaling properties, evaluating both the denoising backbones and datasets through empirical studies. They conduct a comparative analysis of different model architectures, notably UNet and transformer variants, exploring how scaling model parameters, computational resources, and datasets impact performance.

Key findings highlight how specific UNet designs outperform others with similar parameter sizes due to differences in cross-attention implementations and architectural designs. The paper provides an in-depth analysis of scaling dynamics by training models with varying parameters, from 0.4 billion to 4 billion, and datasets that span up to 600 million images. Notably, the paper identifies an efficient UNet model variant that is 45% smaller and 28% faster than SDXL’s UNet while retaining competitive performance. Enhanced training data is shown to correlate strongly with performance gains, emphasizing the significance of dataset quality over mere size.

Scaling model capacity is not straightforward, as the positioning and number of cross-attention layers prominently affect UNet models' performance. Increasing the transformer block count appears to be more efficient for text-image alignment improvements than increasing channel numbers. The impact of dataset scaling was investigated, revealing that caption density and diversity play crucial roles in aligning text and image representations better and improving learning efficiency.

The authors provide functions to predict text-image alignment performance based on model size, compute, and dataset size, informing strategies for more efficient scaling. They further explore the capabilities of transformer-based backbones as potential alternatives to UNet by examining PixArt- $\alpha$ models, noting that though they display potential, transformers require substantial computational resources and pretraining to match UNet's efficacy.

This exploration into the scaling laws helps integrate various methodological aspects, laying a foundation for future T2I diffusion models that are both computationally efficient and effective in text-image alignment. The comparison extends to large-scale controlled studies featuring different denoising backbones for text-to-image synthesis, a notable contribution aiming to achieve efficiency in model training and inference.

Instead of focusing solely on model size, expanded training datasets play a pivotal role in enhancing both small and large models' performance, underscoring that dataset scale and quality determine a model’s upper performance bound. As the diffusion model landscape develops, these insights could inform practical approaches for scaling models in AI, potentially leading to significant improvements in AI's ability to generate coherent images from textual descriptions.

Practical implications include potential developments in computationally efficient models for creative industries and high-fidelity visual content generation. These findings may also inspire new research directions within theoretical AI frameworks, influencing how future AI models balance parameter growth with computational constraints and data scaling.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1775720301885583753

Reddit

[2404.02883] On the Scalability of Diffusion-based Text-to-Image Generation (1 point, 0 comments)