Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (2405.16759v1)

Published 27 May 2024 in cs.CV and cs.LG

Abstract: We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Anlatan. Novelai improvements on stable diffusion. URL https://blog.novelai.net/.
  2. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  3. Lumiere: A space-time diffusion model for video generation, 2024.
  4. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  5. Muse: Text-to-image generation via masked generative transformers. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 4055–4075. PMLR, 23–29 Jul 2023.
  6. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  7. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In ICCV, 2023.
  8. Microsoft coco captions: Data collection and evaluation server. CoRR, abs/1504.00325, 2015.
  9. Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. In ICLR, 2024.
  10. Diffusion posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning Representations, 2023.
  11. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807, 2023.
  12. P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. NeurIPS, pages 8780–8794, 2021.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  14. Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2019.
  15. Diffusion models as plug-and-play priors. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
  16. Matryoshka diffusion models. In The Twelfth International Conference on Learning Representations, 2023.
  17. Multistep consistency models, 2024.
  18. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  19. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6629–6640, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  20. J. Ho and T. Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  21. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, pages 6840–6851. Curran Associates, Inc.
  22. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res., 23(1), jan 2022a. ISSN 1532-4435.
  23. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022b.
  24. Simple diffusion: End-to-end diffusion for high resolution images. In ICML, 2023a.
  25. Simple diffusion: End-to-end diffusion for high resolution images. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023b.
  26. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20406–20417, October 2023.
  27. Scalable adaptive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022.
  28. Intriguing properties of generative classifiers. In The Twelfth International Conference on Learning Representations, 2024.
  29. Robust compressed sensing mri with deep generative priors. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 14938–14954. Curran Associates, Inc., 2021.
  30. Rethinking fid: Towards a better evaluation metric for image generation. arXiv preprint arXiv:2401.09603, 2023.
  31. Z. Kadkhodaie and E. Simoncelli. Stochastic solutions for linear inverse problems using the prior implicit in a denoiser. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 13242–13254. Curran Associates, Inc., 2021.
  32. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
  33. Denoising diffusion restoration models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
  34. Confidence-aware reward optimization for fine-tuning text-to-image models. In The Twelfth International Conference on Learning Representations, 2024.
  35. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  36. Open-vocabulary object detection upon frozen vision and language models. In The Eleventh International Conference on Learning Representations, 2023.
  37. Holistic evaluation of text-to-image models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  38. Controllable music production with diffusion models and guidance gradients. In NeurIPS, 2023.
  39. Character-aware models improve visual text rendering. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, July 2023. Association for Computational Linguistics.
  40. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2022.
  41. A. Nieder and S. Dehaene. Representation of number in the brain. Annual review of neuroscience, 32:185–208, 2009.
  42. Dinov2: Learning robust visual features without supervision, 2023.
  43. Toward verifiable and reproducible human evaluation for text-to-image generation. In Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 14277–14286. IEEE, 2023. 10.1109/CVPR52729.2023.01372. Publisher Copyright: © 2023 IEEE.; IEEE/CVF Conference on Computer Vision and Pattern Recognition ; Conference date: 18-06-2023 Through 22-06-2023.
  44. W. Peebles and S. Xie. Scalable diffusion models with transformers. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2023. 10.1109/ICCV51070.2023.00387.
  45. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In ICLR, 2024.
  46. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations, 2023.
  47. Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021a.
  48. Learning transferable visual models from natural language supervision. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 18–24 Jul 2021b.
  49. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020a.
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020b.
  51. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
  52. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  53. Photorealistic text-to-image diffusion models with deep language understanding. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022a.
  54. Photorealistic text-to-image diffusion models with deep language understanding. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36479–36494. Curran Associates, Inc., 2022b.
  55. The emotions of the crowd: Learning image sentiment from tweets via cross-modal distillation. arXiv preprint arXiv:2304.14942, 2023.
  56. Solving inverse problems with latent diffusion models via hard data consistency. In The Twelfth International Conference on Learning Representations, 2024.
  57. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023.
  58. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In Advances in Neural Information Processing Systems, volume 36, 2023.
  59. Motion to dance music generation using latent diffusion model. In SIGGRAPH Asia 2023 Technical Communications, SA ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400703140. 10.1145/3610543.3626164.
  60. Emergent correspondence from image diffusion. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  61. Diffusion with forward models: Solving stochastic inverse problems without direct supervision. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  62. Proper reuse of image classification features improves object detection. 2022.
  63. Revisiting text-to-image evaluation with gecko: On metrics, prompts, and human ratings. Under review (ECCV), 2024.
  64. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  65. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics, 10:291–306, 03 2022a.
  66. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022b.
  67. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation, 2022.
  68. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional CLIP. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
  69. What does stable diffusion know about the 3d scene?, 2023.
Citations (1)

Summary

  • The paper presents a novel Shallow-UViT architecture that decouples text-to-image alignment from high-resolution rendering to enhance model stability.
  • It introduces a greedy growing algorithm that gradually scales resolution while allowing training with smaller batch sizes and reduced computational load.
  • Empirical results show marked improvements in metrics like FID, CLIP Score, and human evaluations, outperforming traditional cascaded methods.

Analyzing Shallow-UViT and Greedy Growing Strategies for High-Resolution Diffusion Models

The focus of this paper is on the development and evaluation of large-scale Pixel-Space text-to-image Diffusion Models (PSDMs) for generating high-resolution images. The challenge in training these models arises from optimization instabilities and the massive computational resources required, especially as model size and target image resolution increase. Traditional approaches, like cascaded models and Latent Diffusion Models (LDMs), often involve multiple stages of independent diffusion models or operate in a low-dimensional latent space. These methods can degrade image quality due to distribution shifts between training images and those generated during inference, particularly affecting the synthesis of small objects like faces and hands.

Key Contributions

  1. Shallow-UViT Architecture:
    • The paper introduces "Shallow-UViT," a novel architecture for decoupling the training of 'visual concepts' from the image resolution at which these concepts are rendered.
    • Shallow-UViT allows pretraining of core layers on large datasets of text-image pairs, facilitating training at lower resolutions and thus addressing memory and computational resource barriers.
    • This approach separately focuses on text-to-image alignment and image generation at final resolution, enhancing stability and performance.
  2. Greedy Growing Algorithm:
    • A novel training procedure, described as a greedy algorithm, allows for gradual scaling of model resolution while retaining the stability of the pretrained core representation layers.
    • The algorithm separates the training phases for core components (text-to-image alignment) and resolution-specific components (high-resolution generation).
    • This method allows for the successful training of high-resolution models with smaller batch sizes, reducing resource requirements.
  3. Empirical Scaling and Performance Analysis:
    • The paper provides scaling results for Shallow-UViT models and demonstrates significant improvements in standard image distribution metrics (FID, FD-Dino, CMMD) and text-image alignment (CLIP Score) as model size increases.
    • A systematic comparison of models trained from scratch, finetuned, and using frozen core layers illustrates that freezing the pretrained representation yields better image quality and optimization stability in larger models.
  4. Vermeer: A High-Resolution Prototype:
    • The final part of the work showcases "Vermeer," a large-scale, non-cascaded text-to-image diffusion model trained with the proposed greedy growing algorithm, incorporating techniques like prompt preemption and style tuning.
    • Human evaluation studies reveal Vermeer is preferred over previous models like SDXL by a significant margin in terms of image quality and consistency with text prompts.

Implications and Future Directions

The implications of this research are multifaceted:

  • Practical Applications: The methods developed allow for the high-fidelity generation of high-resolution images without the drawbacks of traditional cascaded approaches, making them highly applicable in areas requiring detailed synthetic imagery.
  • Resource Efficiency: The ability to train large-scale models with smaller batch sizes and reduced computational load makes these techniques more accessible and sustainable.
  • Downstream Tasks: The paper hints at the broader applicability of these models beyond image generation, suggesting potential improvements in solving inverse problems or other generative tasks where high-resolution models are beneficial.

Future Developments

  • Further Scaling: Extending the methodologies to train even larger models and achieving finer image resolution remains an area for future research.
  • Enhanced Core Design: Investigating more sophisticated designs for core components could further enhance the quality and stability of the models.
  • Real-World Data: Applying these methods to diverse, real-world datasets can push the boundaries of performance and generalizability.

In conclusion, this paper presents a structured approach to addressing the challenges of training high-resolution, large-scale text-to-image diffusion models by decoupling the learning phases for alignment and resolution, supported by innovative architecture and training algorithms. The demonstrated improvements in empirical metrics and human preference studies underscore the potential of these methods in advancing state-of-the-art generative AI.