Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (2405.16759v1)
Abstract: We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.
- Anlatan. Novelai improvements on stable diffusion. URL https://blog.novelai.net/.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Lumiere: A space-time diffusion model for video generation, 2024.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
- Muse: Text-to-image generation via masked generative transformers. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 4055–4075. PMLR, 23–29 Jul 2023.
- Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In ICCV, 2023.
- Microsoft coco captions: Data collection and evaluation server. CoRR, abs/1504.00325, 2015.
- Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. In ICLR, 2024.
- Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2023.
- Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
- P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. NeurIPS, pages 8780–8794, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
- Diffusion models as plug-and-play priors. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Matryoshka diffusion models. In The Twelfth International Conference on Learning Representations, 2023.
- Multistep consistency models, 2024.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6629–6640, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
- J. Ho and T. Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, pages 6840–6851. Curran Associates, Inc.
- Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23(1), jan 2022a. ISSN 1532-4435.
- Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022b.
- Simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023a.
- Simple diffusion: End-to-end diffusion for high resolution images. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023b.
- Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20406–20417, October 2023.
- Scalable adaptive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022.
- Intriguing properties of generative classifiers. In The Twelfth International Conference on Learning Representations, 2024.
- Robust compressed sensing mri with deep generative priors. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 14938–14954. Curran Associates, Inc., 2021.
- Rethinking fid: Towards a better evaluation metric for image generation. arXiv preprint arXiv:2401.09603, 2023.
- Z. Kadkhodaie and E. Simoncelli. Stochastic solutions for linear inverse problems using the prior implicit in a denoiser. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 13242–13254. Curran Associates, Inc., 2021.
- Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
- Denoising diffusion restoration models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Confidence-aware reward optimization for fine-tuning text-to-image models. In The Twelfth International Conference on Learning Representations, 2024.
- Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
- Open-vocabulary object detection upon frozen vision and language models. In The Eleventh International Conference on Learning Representations, 2023.
- Holistic evaluation of text-to-image models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- Controllable music production with diffusion models and guidance gradients. In NeurIPS, 2023.
- Character-aware models improve visual text rendering. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, July 2023. Association for Computational Linguistics.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2022.
- A. Nieder and S. Dehaene. Representation of number in the brain. Annual review of neuroscience, 32:185–208, 2009.
- Dinov2: Learning robust visual features without supervision, 2023.
- Toward verifiable and reproducible human evaluation for text-to-image generation. In Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 14277–14286. IEEE, 2023. 10.1109/CVPR52729.2023.01372. Publisher Copyright: © 2023 IEEE.; IEEE/CVF Conference on Computer Vision and Pattern Recognition ; Conference date: 18-06-2023 Through 22-06-2023.
- W. Peebles and S. Xie. Scalable diffusion models with transformers. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2023. 10.1109/ICCV51070.2023.00387.
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In ICLR, 2024.
- Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2023.
- Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021a.
- Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021b.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020a.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020b.
- Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022a.
- Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36479–36494. Curran Associates, Inc., 2022b.
- The emotions of the crowd: Learning image sentiment from tweets via cross-modal distillation. arXiv preprint arXiv:2304.14942, 2023.
- Solving inverse problems with latent diffusion models via hard data consistency. In The Twelfth International Conference on Learning Representations, 2024.
- Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023.
- Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In Advances in Neural Information Processing Systems, volume 36, 2023.
- Motion to dance music generation using latent diffusion model. In SIGGRAPH Asia 2023 Technical Communications, SA ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400703140. 10.1145/3610543.3626164.
- Emergent correspondence from image diffusion. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Diffusion with forward models: Solving stochastic inverse problems without direct supervision. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Proper reuse of image classification features improves object detection. 2022.
- Revisiting text-to-image evaluation with gecko: On metrics, prompts, and human ratings. Under review (ECCV), 2024.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
- ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics, 10:291–306, 03 2022a.
- Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022b.
- Scaling Autoregressive Models for Content-Rich Text-to-Image Generation, 2022.
- Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional CLIP. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- What does stable diffusion know about the 3d scene?, 2023.