SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis (2307.01952v1)
Abstract: We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models
- eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324, 2022.
- TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation. arXiv:2303.04248, 2023.
- Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021.
- Distilling the Knowledge in Diffusion Models. CVPR Workshop on Generative Models for Computer Vision, 2023.
- Structure and content-guided video synthesis with diffusion models, 2023.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv:2212.05032, 2023.
- Riffusion - Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv:2208.01618, 2022.
- Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023. URL https://www.crosslabs.org/blog/diffusion-with-offset-noise.
- GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500, 2017.
- Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022.
- Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239, 2020.
- Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022.
- simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
- Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arXiv:2301.12661, 2023.
- Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005.
- OpenCLIP, July 2021. URL https://doi.org/10.5281/zenodo.5143773.
- Distribution Augmentation for Generative Modeling. In International Conference on Machine Learning, pages 5006–5019. PMLR, 2020.
- Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364, 2022.
- On Architectural Compression of Text-to-Image Diffusion Models. arXiv:2305.15798, 2023.
- Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv:2305.01569, 2023.
- SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. arXiv:2306.00980, 2023.
- Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023.
- Microsoft coco: Common objects in context, 2015.
- Character-aware models improve visual text rendering, 2023.
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073, 2021.
- On distillation of guided diffusion models, 2023.
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv:2112.10741, 2021.
- NovelAI. Novelai improvements on stable diffusion, 2023. URL https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac.
- Pytorch: An imperative style, high-performance deep learning library, 2019.
- Scalable Diffusion Models with Transformers. arXiv:2212.09748, 2022.
- Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021.
- Aditya Ramesh. How dall·e 2 works, 2022. URL http://adityaramesh.com/posts/dalle2/dalle2.html.
- Zero-shot text-to-image generation, 2021.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022.
- High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487, 2022.
- Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022.
- Improved Techniques for Training GANs. arXiv:1606.03498, 2016.
- DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification. arXiv:2305.15957, 2023.
- Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022.
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015.
- Denoising diffusion implicit models. arXiv:2010.02502, 2020a.
- Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, 2020b.
- Andreas Stöckl. Evaluating a synthetic image dataset generated with stable diffusion. arXiv:2211.01777, 2022.
- Yu Takagi and Shinji Nishimoto. High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023.
- LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023.
- Boosting gui prototyping with diffusion models. arXiv preprint arXiv:2306.06233, 2023.
- Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022.
- Scaling autoregressive models for content-rich text-to-image generation, 2022.
- Adding conditional control to text-to-image diffusion models. arXiv:2302.05543, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric, 2018.
- Dustin Podell (3 papers)
- Zion English (4 papers)
- Kyle Lacey (3 papers)
- Andreas Blattmann (15 papers)
- Tim Dockhorn (13 papers)
- Jonas Müller (28 papers)
- Joe Penna (2 papers)
- Robin Rombach (24 papers)