Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (2307.01952v1)

Published 4 Jul 2023 in cs.CV and cs.AI

Abstract: We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324, 2022.
  2. TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation. arXiv:2303.04248, 2023.
  3. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023.
  4. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  5. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021.
  6. Distilling the Knowledge in Diffusion Models. CVPR Workshop on Generative Models for Computer Vision, 2023.
  7. Structure and content-guided video synthesis with diffusion models, 2023.
  8. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv:2212.05032, 2023.
  9. Riffusion - Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv:2208.01618, 2022.
  11. Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023. URL https://www.crosslabs.org/blog/diffusion-with-offset-noise.
  12. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500, 2017.
  13. Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022.
  14. Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239, 2020.
  15. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022.
  16. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  17. Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arXiv:2301.12661, 2023.
  18. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005.
  19. OpenCLIP, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
  20. Distribution Augmentation for Generative Modeling. In International Conference on Machine Learning, pages 5006–5019. PMLR, 2020.
  21. Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364, 2022.
  22. On Architectural Compression of Text-to-Image Diffusion Models. arXiv:2305.15798, 2023.
  23. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv:2305.01569, 2023.
  24. SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. arXiv:2306.00980, 2023.
  25. Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023.
  26. Microsoft coco: Common objects in context, 2015.
  27. Character-aware models improve visual text rendering, 2023.
  28. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073, 2021.
  29. On distillation of guided diffusion models, 2023.
  30. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741, 2021.
  31. NovelAI. Novelai improvements on stable diffusion, 2023. URL https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac.
  32. Pytorch: An imperative style, high-performance deep learning library, 2019.
  33. Scalable Diffusion Models with Transformers. arXiv:2212.09748, 2022.
  34. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021.
  35. Aditya Ramesh. How dall·e 2 works, 2022. URL http://adityaramesh.com/posts/dalle2/dalle2.html.
  36. Zero-shot text-to-image generation, 2021.
  37. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022.
  38. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021.
  39. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015.
  40. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487, 2022.
  41. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022.
  42. Improved Techniques for Training GANs. arXiv:1606.03498, 2016.
  43. DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification. arXiv:2305.15957, 2023.
  44. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022.
  45. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015.
  46. Denoising diffusion implicit models. arXiv:2010.02502, 2020a.
  47. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, 2020b.
  48. Andreas Stöckl. Evaluating a synthetic image dataset generated with stable diffusion. arXiv:2211.01777, 2022.
  49. Yu Takagi and Shinji Nishimoto. High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023.
  50. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023.
  51. Boosting gui prototyping with diffusion models. arXiv preprint arXiv:2306.06233, 2023.
  52. Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022.
  53. Scaling autoregressive models for content-rich text-to-image generation, 2022.
  54. Adding conditional control to text-to-image diffusion models. arXiv:2302.05543, 2023.
  55. The unreasonable effectiveness of deep features as a perceptual metric, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Dustin Podell (3 papers)
  2. Zion English (4 papers)
  3. Kyle Lacey (3 papers)
  4. Andreas Blattmann (15 papers)
  5. Tim Dockhorn (13 papers)
  6. Jonas Müller (28 papers)
  7. Joe Penna (2 papers)
  8. Robin Rombach (24 papers)
Citations (1,370)

Summary

  • The paper introduces a novel SDXL model that expands the UNet backbone and employs dual text encoders to significantly enhance high-resolution image synthesis.
  • The paper details innovative conditioning techniques, including size and crop conditioning, which improve data utilization and reduce artifacts.
  • The paper emphasizes transparency by releasing its code and weights, fostering reproducibility and further advancements in generative modeling.

Introduction to SDXL

In the vast and rapidly evolving landscape of text-to-image synthesis, SDXL emerges as a remarkable enhancement to the widely known Stable Diffusion framework. Its predecessor has already established itself as a foundational tool for a myriad of applications, ranging from entertainment to scientific visualizations. SDXL takes a leap forward by implementing an expanded UNet backbone which is thrice the size of its antecedents. This is achieved through a more complex distribution of attention blocks and an enlarged cross-attention context enabled by a dual text encoder. The architecture's novelty is also embodied in the introduction of novel conditioning techniques that do not require additional supervision for training, as well as a distinct refinement model aimed at post-processing the generated images to append visual fidelity.

Architectural Enrichments in SDXL

The architectural advancements manifest themselves in several dimensions. The model eschews transformers at the highest feature level for efficiency and instead enlists them extensively at lower levels. The heterogeneity of the distribution of transformer blocks underscores a concentrated computation shift towards lower-level features within the UNet. A salient shift from the original architecture is marked by the adoption of a powerful combined text encoder—CLIP ViT-L and OpenCLIP ViT-bigG—which are concatenated to intensify text conditioning capability. An aggregation of the pooled text embedding from OpenCLIP further fortifies the text-based conditioning, resulting in a UNet with 2.6 billion parameters.

Innovations in Conditioning Techniques

SDXL introduces two ingenious conditioning mechanisms. The first, size conditioning—accounts for the original spatial dimensions of training images, mitigating the issue of downsampling or discarding images below a pre-set resolution threshold; an aspect that has historically handicapped LDMs. This development ensures a more thorough utilization of available data, avoiding loss in model generalization capability. The second method, crop conditioning—ensures the model is attuned to the amount of cropping applied during the training phase, annuls negative artifacts that could arise from random cropping, and aligns the model's outputs with aesthetically appealing, object-centered preferences. Furthermore, multi-aspect training prepares SDXL for handling multiple aspect ratios, a significant step towards creating diverse and naturally appealing images in line with the real-world distribution of aspect ratios.

Unified Improvement and Transparency

Conclusively, SDXL ventures beyond just an incremental improvement by providing a cohesive model that encompasses both structural and conditioning advancements. In stark contrast with the 'black-box' approach often characteristic of cutting-edge image generators, SDXL stands out by releasing its code and model weights to the community, fostering an environment of open research and methodical transparency. This approach addresses concerns regarding reproducibility, innovation, and the assessment of biases in AI image generation models, all while achieving superior performance and production of visually compelling outputs compared to past iterations of Stable Diffusion.

The presented work signifies a significant progression in the text-to-image domain, offering several trajectories for further enhancement in model performance, architecture, and the potential for distillation to optimize computational demands. The transparency and open nature of SDXL serve as a catalyst for ongoing research and possibly pave the way for future breakthroughs in generative modeling.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com