Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss (2401.02677v1)

Published 5 Jan 2024 in cs.CV and cs.AI

Abstract: Stable Diffusion XL (SDXL) has become the best open source text-to-image model (T2I) for its versatility and top-notch image quality. Efficiently addressing the computational demands of SDXL models is crucial for wider reach and applicability. In this work, we introduce two scaled-down variants, Segmind Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter UNets, respectively, achieved through progressive removal using layer-level losses focusing on reducing the model size while preserving generative quality. We release these models weights at https://hf.co/Segmind. Our methodology involves the elimination of residual networks and transformer blocks from the U-Net structure of SDXL, resulting in significant reductions in parameters, and latency. Our compact models effectively emulate the original SDXL by capitalizing on transferred knowledge, achieving competitive results against larger multi-billion parameter SDXL. Our work underscores the efficacy of knowledge distillation coupled with layer-level losses in reducing model size while preserving the high-quality generative capabilities of SDXL, thus facilitating more accessible deployment in resource-constrained environments.

References (14)

Authors (5)

Yatharth Gupta (1 paper)
Vishnu V. Jaddipal (1 paper)
Harish Prabhala (1 paper)
Sayak Paul (18 papers)
Patrick von Platen (15 papers)

Citations (21)

View on Semantic Scholar

Summary

Introduction to Model Compression

Stable Diffusion XL (SDXL) is a state-of-the-art text-to-image model greatly admired for its image generation capabilities. However, due to its large size, the model demands considerable computational resources, which can be a barrier for many users. The paper presents an innovative approach to model compression that introduces scaled-down variants of SDXL, called Segmind Stable Diffusion (SSD-1B) and Segmind-Vega. These variants are designed with fewer parameters, aiming to deliver similar performance while enhancing accessibility and reducing computational load.

Knowledge Distillation Approach

The core of this model compression lies in knowledge distillation, a process where a smaller model (student) learns to replicate the performance of a larger model (teacher). The authors achieved size reduction by eliminating certain layers within SDXL's U-Net architecture, focusing on residual networks and transformer blocks that account for substantial parameters. This eliminates redundancy without compromising on image quality. The paper also showcases how these technique preserves the high-quality generative capabilities of the original SDXL. The reduced-sized models, released on popular machine learning platforms, illustrate the successful application of knowledge distillation at the layer level.

Efficient Diffusion Models and Training

Investigating the efficient adaptation of diffusion models, the researchers adopted a methodical pruning strategy, rigorously evaluating which layers can be omitted. They chose layers whose absence had minimal impact on image generation quality, confirmed through both human evaluation and heuristic methods. Training details reveal that models were optimized for high resolution imagery and were trained using mixed-precision on powerful GPUs, showcasing the intensive computational effort involved. Even so, the compression methods employed dramatically decreased both the training steps and resources needed.

Evaluation and Implications

Comparative evaluations highlight the potential of model compression. SSD-1B and Segmind-Vega performed impressively, benchmarking close to the larger SDXL’s output with significantly faster inference times. The validity of these findings was reinforced by a comprehensive human preference paper, where the distilled SSD-1B model was even slightly favored over SDXL. These conclusions not only underscore the feasibility of compressing complex generative models but also hint at the applicability of such methods across other large machine learning models. The paper concludes by recognizing the importance of the parent models in distillation and suggests possible future explorations into distilling other major AI models.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1744190813423157333

https://twitter.com/RisingSayak/status/1744549780804747725

https://twitter.com/fly51fly/status/1744477454381797380

https://twitter.com/knishimae0531/status/1744308359040577709

https://twitter.com/javaeeeee1/status/1746183856271642878

https://twitter.com/javaeeeee1/status/1744338350616510710