BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion (2305.15798v4)

Published 25 May 2023 in cs.LG

Abstract: Text-to-image (T2I) generation with Stable Diffusion models (SDMs) involves high computing demands due to billion-scale parameters. To enhance efficiency, recent studies have reduced sampling steps and applied network quantization while retaining the original architectures. The lack of architectural reduction attempts may stem from worries over expensive retraining for such massive models. In this work, we uncover the surprising potential of block pruning and feature distillation for low-cost general-purpose T2I. By removing several residual and attention blocks from the U-Net of SDMs, we achieve 30%~50% reduction in model size, MACs, and latency. We show that distillation retraining is effective even under limited resources: using only 13 A100 days and a tiny dataset, our compact models can imitate the original SDMs (v1.4 and v2.1-base with over 6,000 A100 days). Benefiting from the transferred knowledge, our BK-SDMs deliver competitive results on zero-shot MS-COCO against larger multi-billion parameter models. We further demonstrate the applicability of our lightweight backbones in personalized generation and image-to-image translation. Deployment of our models on edge devices attains 4-second inference. Code and models can be found at: https://github.com/Nota-NetsPresso/BK-SDM

Authors (4)

Bo-Kyeong Kim (9 papers)
Hyoung-Kyu Song (9 papers)
Thibault Castells (5 papers)
Shinkook Choi (9 papers)

Citations (14)

View on Semantic Scholar

Summary

Overview of BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

The paper presents BK-SDM, a novel approach to compressing Stable Diffusion Models (SDMs) used for text-to-image (T2I) generation. Stable Diffusion has gained significant traction due to its open-source availability and wide applicability in text-guided vision tasks. However, these models are computationally intensive, involving high parameter and latency demands. This work proposes a method to alleviate such computational overheads by introducing architectural compression, specifically through block pruning and feature distillation of the U-Net within SDMs.

Compression Methodology

The authors introduce an innovative architectural compression scheme for SDMs, primarily focusing on reducing the U-Net's computational load, the core component responsible for denoising during image generation. This method comprises several stages:

Block Pruning: The authors propose to prune several residual (R) and attention (A) blocks from the U-Net, aiming for a reduction in both parameters and latency. This pruning approach is informed by sensitivity analysis, identifying which blocks contribute least to model performance and can be safely removed. This technique results in models like BK-SDM-Base, BK-SDM-Small, and BK-SDM-Tiny, which exhibit up to a 51% reduction in size and a 43% improvement in latency.
Knowledge Distillation (KD): To retrain the pruned model effectively, the authors leverage output-level and feature-level knowledge distillation from the original, uncompressed model. This allows the smaller model to approximate the performance of its larger predecessor efficiently. Remarkably, the retraining requires only 13 A100 GPU days compared to thousands needed for training standard SDMs.

Experimental Results

The model was tested on several benchmarks, demonstrating that BK-SDMs achieve competitive results with significantly reduced computational resources. For instance, despite utilizing only 0.22 million image-text pairs (a fraction of the data used for standard SDMs), BK-SDMs attain commendable zero-shot performance on the MS-COCO dataset. The smaller models also inherit the visual style capabilities of the larger SDMs, maintaining photorealistic outcomes with reduced fidelity loss.

Implications and Future Directions

The practical implications of this research are profound, especially for deploying T2I models on resource-constrained edge devices. BK-SDM models open avenues for utilizing diffusion models in low-latency, budget-constrained scenarios, such as mobile applications, where inference times are critical. For example, deployment on devices such as the NVIDIA Jetson AGX Orin and iPhone 14 showed inference speeds under 4 seconds, marking a substantial decrease from standard configurations.

Theoretical implications include the potential to extend this compression technique to other diffusion-based generative models, further broadening the utility and accessibility of these architectures. The approach signifies a step towards more economical training and deployment of AI models, aligning with ongoing industry efforts to democratize AI technologies.

Future Work

Future research could explore combining these compression methods with step-reducing algorithms or fast solvers for diffusion processes to further enhance the performance and efficiency of BK-SDMs. Moreover, expanding this framework's applicability to other models with intricate architectures and integration with quantization techniques holds promise for even further efficiency gains.

In summary, BK-SDM presents a compelling case for architectural compression in large-scale models, merging sophistication with practicality to push the boundaries of efficient generative AI deployment and research.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - Nota-NetsPresso/BK-SDM: A Compressed Stable Diffusion for Efficient Text-to-Image Generation [ICCV'23 Demo] [ICML'23 Workshop] (211 stars)

Tweets

https://twitter.com/TuliMathieu/status/1866155554437005321

YouTube

Show All Videos