Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models (2405.16759v1)

Published 27 May 2024 in cs.CV and cs.LG

Abstract: We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.

References (69)

Citations (1)

View on Semantic Scholar

Summary

The paper presents a novel Shallow-UViT architecture that decouples text-to-image alignment from high-resolution rendering to enhance model stability.
It introduces a greedy growing algorithm that gradually scales resolution while allowing training with smaller batch sizes and reduced computational load.
Empirical results show marked improvements in metrics like FID, CLIP Score, and human evaluations, outperforming traditional cascaded methods.

Analyzing Shallow-UViT and Greedy Growing Strategies for High-Resolution Diffusion Models

The focus of this paper is on the development and evaluation of large-scale Pixel-Space text-to-image Diffusion Models (PSDMs) for generating high-resolution images. The challenge in training these models arises from optimization instabilities and the massive computational resources required, especially as model size and target image resolution increase. Traditional approaches, like cascaded models and Latent Diffusion Models (LDMs), often involve multiple stages of independent diffusion models or operate in a low-dimensional latent space. These methods can degrade image quality due to distribution shifts between training images and those generated during inference, particularly affecting the synthesis of small objects like faces and hands.

Key Contributions

Shallow-UViT Architecture:
- The paper introduces "Shallow-UViT," a novel architecture for decoupling the training of 'visual concepts' from the image resolution at which these concepts are rendered.
- Shallow-UViT allows pretraining of core layers on large datasets of text-image pairs, facilitating training at lower resolutions and thus addressing memory and computational resource barriers.
- This approach separately focuses on text-to-image alignment and image generation at final resolution, enhancing stability and performance.
Greedy Growing Algorithm:
- A novel training procedure, described as a greedy algorithm, allows for gradual scaling of model resolution while retaining the stability of the pretrained core representation layers.
- The algorithm separates the training phases for core components (text-to-image alignment) and resolution-specific components (high-resolution generation).
- This method allows for the successful training of high-resolution models with smaller batch sizes, reducing resource requirements.
Empirical Scaling and Performance Analysis:
- The paper provides scaling results for Shallow-UViT models and demonstrates significant improvements in standard image distribution metrics (FID, FD-Dino, CMMD) and text-image alignment (CLIP Score) as model size increases.
- A systematic comparison of models trained from scratch, finetuned, and using frozen core layers illustrates that freezing the pretrained representation yields better image quality and optimization stability in larger models.
Vermeer: A High-Resolution Prototype:
- The final part of the work showcases "Vermeer," a large-scale, non-cascaded text-to-image diffusion model trained with the proposed greedy growing algorithm, incorporating techniques like prompt preemption and style tuning.
- Human evaluation studies reveal Vermeer is preferred over previous models like SDXL by a significant margin in terms of image quality and consistency with text prompts.

Implications and Future Directions

The implications of this research are multifaceted:

Practical Applications: The methods developed allow for the high-fidelity generation of high-resolution images without the drawbacks of traditional cascaded approaches, making them highly applicable in areas requiring detailed synthetic imagery.
Resource Efficiency: The ability to train large-scale models with smaller batch sizes and reduced computational load makes these techniques more accessible and sustainable.
Downstream Tasks: The paper hints at the broader applicability of these models beyond image generation, suggesting potential improvements in solving inverse problems or other generative tasks where high-resolution models are beneficial.

Future Developments

Further Scaling: Extending the methodologies to train even larger models and achieving finer image resolution remains an area for future research.
Enhanced Core Design: Investigating more sophisticated designs for core components could further enhance the quality and stability of the models.
Real-World Data: Applying these methods to diverse, real-world datasets can push the boundaries of performance and generalizability.

In conclusion, this paper presents a structured approach to addressing the challenges of training high-resolution, large-scale text-to-image diffusion models by decoupling the learning phases for alignment and resolution, supported by innovative architecture and training algorithms. The demonstrated improvements in empirical metrics and human preference studies underscore the potential of these methods in advancing state-of-the-art generative AI.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1795293460435464245

https://twitter.com/cloneofsimo/status/1795337619972489453

https://twitter.com/arankomatsuzaki/status/1795292746569384421

https://twitter.com/gm8xx8/status/1795296678301598156