Effective Quantization for Diffusion Models on CPUs (2311.16133v2)

Published 2 Nov 2023 in cs.CV and cs.AI

Abstract: Diffusion models have gained popularity for generating images from textual descriptions. Nonetheless, the substantial need for computational resources continues to present a noteworthy challenge, contributing to time-consuming processes. Quantization, a technique employed to compress deep learning models for enhanced efficiency, presents challenges when applied to diffusion models. These models are notably more sensitive to quantization compared to other model types, potentially resulting in a degradation of image quality. In this paper, we introduce a novel approach to quantize the diffusion models by leveraging both quantization-aware training and distillation. Our results show the quantized models can maintain the high image quality while demonstrating the inference efficiency on CPUs. The code is publicly available at: https://github.com/intel/intel-extension-for-transformers.

References (12)

Authors (10)

Hanwen Chang (4 papers)
Haihao Shen (11 papers)
Yiyang Cai (7 papers)
Xinyu Ye (6 papers)
Zhenzhong Xu (1 paper)
Wenhua Cheng (3 papers)
Kaokao Lv (3 papers)
Weiwei Zhang (80 papers)
Yintong Lu (1 paper)
Heng Guo (94 papers)

Citations (6)

View on Semantic Scholar

Summary

Effective Quantization for Diffusion Models on CPUs

The paper "Effective Quantization for Diffusion Models on CPUs" addresses significant computational resource demands in the application of diffusion models for image generation. By introducing a novel quantization approach, the authors seek to maintain high image quality while improving inference efficiency, particularly on CPUs. This research explores the application of quantization-aware training and distillation to diffusion models, focusing on both theoretical contributions and practical implementations.

Key Contributions

The paper presents three primary contributions:

Precision Strategies for Diffusion Models: The authors design precision strategies tailored to diffusion models, optimizing quantization processes to improve computational efficiency without degrading image quality. These strategies are crucial when applying quantization to models like Stable Diffusion.
Efficient Inference Runtime: A new inference runtime with high-performance kernels is developed specifically for CPU-based calculations, substantially reducing the time required for image generation. The method achieves image generation in under 6 seconds on Intel CPUs, with resolutions of 512x512 pixels.
Validation Across Model Versions: Validation is performed on various versions of Stable Diffusion (1.4, 1.5, and 2.1), consistently demonstrating the efficacy of the proposed approach.

Methodological Details

The quantization of diffusion models is realized through time-dependent mixed precision frameworks and quantization-aware training specifically for the Unet architecture. The paper elaborates on the use of selective precision, where initial and final steps of the denoising process utilize higher precision models, such as BFloat16, while intermediate steps adopt lower precision, such as INT8. This approach balances computational efficiency and model accuracy.

To further enhance performance, the authors optimize data layout for operations like GroupNorm and introduce a memory allocator that improves Multi-Head Attention execution. Their restructuring significantly reduces CPU utilization inefficiencies.

Experimental Findings

The experimental results denote strong performance in maintaining image quality, as measured by Frechet Inception Distance (FID) scores. The mixed precision models achieve scores that are close to full precision models, demonstrating minimal accuracy loss. Performance benchmarks illustrate substantial reductions in latency, with mixed precision models obtaining significantly faster processing times compared to full precision.

Implications and Future Work

This research provides an advanced understanding of applying quantization techniques to complex models like diffusion models, enabling high-quality image generation with reduced computational demand. In practice, these findings can lead to more accessible image generation capabilities on hardware with limited processing power.

Future work includes exploring more aggressive compression techniques, such as 4-bit quantization and sparsity, to further enhance efficiency. Additionally, the potential of early exits in model inference could be investigated to optimize performance further.

In conclusion, this paper offers a methodological advancement in the field of model quantization, providing both a foundation for future research and immediate practical benefits in the field of computational efficiency for image generation tasks.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - intel/intel-extension-for-transformers: ⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡ (2,167 stars)