Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss (2401.02677v1)
Abstract: Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality. Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability. In this work, we introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively, achieved through progressive removal using layer-level losses focusing on reducing the model size while preserving generative quality. We release these models weights at https://hf.co/Segmind. Our methodology involves the elimination of residual networks and transformer blocks from the U-Net structure of SDXL, resulting in significant reductions in parameters, and latency. Our compact models effectively emulate the original SDXL by capitalizing on transferred knowledge, achieving competitive results against larger multi-billion parameter SDXL. Our work underscores the efficacy of knowledge distillation coupled with layer-level losses in reducing model size while preserving the high-quality generative capabilities of SDXL, thus facilitating more accessible deployment in resource-constrained environments.
- Deep residual learning for image recognition, 2015.
- Bk-sdm: A lightweight, fast, and cheap version of stable diffusion, 2023.
- Adam: A method for stochastic optimization, 2017.
- Snapfusion: Text-to-image diffusion model on mobile devices within two seconds, 2023.
- Latent consistency models: Synthesizing high-resolution images with few-step inference, 2023a.
- Lcm-lora: A universal stable-diffusion acceleration module, 2023b.
- Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306.14824, 2023.
- Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023.
- Zero-shot text-to-image generation, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10684–10695, June 2022.
- Photorealistic text-to-image diffusion models with deep language understanding, 2022.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2020.
- If by deepfloyd lab at stabilityai, 2023.
- Yatharth Gupta (1 paper)
- Vishnu V. Jaddipal (1 paper)
- Harish Prabhala (1 paper)
- Sayak Paul (18 papers)
- Patrick von Platen (15 papers)