On the Scalability of Diffusion-based Text-to-Image Generation (2404.02883v1)
Abstract: Scaling up model and data size has been quite successful for the evolution of LLMs. However, the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work, we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set, including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally, we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size, compute and dataset size.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- All are worth words: A vit backbone for diffusion models. In CVPR, 2023.
- Improving image generation with better captions. https://cdn.openai.com/papers/dall-e-3.pdf, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR, 2024.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
- DeepFloyd. Deepfloyd. https://github.com/deep-floyd/IF, 2023.
- Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
- Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
- Masked diffusion transformer is a strong image synthesizer. 2023.
- Generative adversarial nets. In NeurIPS, 2014.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
- Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
- Openclip, 2021.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- LAION. Laion aesthetic v2. https://github.com/christophschuhmann/improved-aesthetic-predictor, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv preprint arXiv:2306.00980, 2023b.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Decoupled weight decay regularization. ICLR, 2019.
- Improved denoising diffusion probabilistic models. In ICML, 2021.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Scalable diffusion models with transformers. In ICCV, 2023.
- Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2021.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Attention is all you need. NeurIPS, 2017.
- Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. In NeurIPS, 2023.
- Diffusion models without attention. arXiv preprint arXiv:2311.18257, 2023.
- Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.