Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Effective Quantization for Diffusion Models on CPUs (2311.16133v2)

Published 2 Nov 2023 in cs.CV and cs.AI

Abstract: Diffusion models have gained popularity for generating images from textual descriptions. Nonetheless, the substantial need for computational resources continues to present a noteworthy challenge, contributing to time-consuming processes. Quantization, a technique employed to compress deep learning models for enhanced efficiency, presents challenges when applied to diffusion models. These models are notably more sensitive to quantization compared to other model types, potentially resulting in a degradation of image quality. In this paper, we introduce a novel approach to quantize the diffusion models by leveraging both quantization-aware training and distillation. Our results show the quantized models can maintain the high image quality while demonstrating the inference efficiency on CPUs. The code is publicly available at: https://github.com/intel/intel-extension-for-transformers.

Effective Quantization for Diffusion Models on CPUs

The paper "Effective Quantization for Diffusion Models on CPUs" addresses significant computational resource demands in the application of diffusion models for image generation. By introducing a novel quantization approach, the authors seek to maintain high image quality while improving inference efficiency, particularly on CPUs. This research explores the application of quantization-aware training and distillation to diffusion models, focusing on both theoretical contributions and practical implementations.

Key Contributions

The paper presents three primary contributions:

  1. Precision Strategies for Diffusion Models: The authors design precision strategies tailored to diffusion models, optimizing quantization processes to improve computational efficiency without degrading image quality. These strategies are crucial when applying quantization to models like Stable Diffusion.
  2. Efficient Inference Runtime: A new inference runtime with high-performance kernels is developed specifically for CPU-based calculations, substantially reducing the time required for image generation. The method achieves image generation in under 6 seconds on Intel CPUs, with resolutions of 512x512 pixels.
  3. Validation Across Model Versions: Validation is performed on various versions of Stable Diffusion (1.4, 1.5, and 2.1), consistently demonstrating the efficacy of the proposed approach.

Methodological Details

The quantization of diffusion models is realized through time-dependent mixed precision frameworks and quantization-aware training specifically for the Unet architecture. The paper elaborates on the use of selective precision, where initial and final steps of the denoising process utilize higher precision models, such as BFloat16, while intermediate steps adopt lower precision, such as INT8. This approach balances computational efficiency and model accuracy.

To further enhance performance, the authors optimize data layout for operations like GroupNorm and introduce a memory allocator that improves Multi-Head Attention execution. Their restructuring significantly reduces CPU utilization inefficiencies.

Experimental Findings

The experimental results denote strong performance in maintaining image quality, as measured by Frechet Inception Distance (FID) scores. The mixed precision models achieve scores that are close to full precision models, demonstrating minimal accuracy loss. Performance benchmarks illustrate substantial reductions in latency, with mixed precision models obtaining significantly faster processing times compared to full precision.

Implications and Future Work

This research provides an advanced understanding of applying quantization techniques to complex models like diffusion models, enabling high-quality image generation with reduced computational demand. In practice, these findings can lead to more accessible image generation capabilities on hardware with limited processing power.

Future work includes exploring more aggressive compression techniques, such as 4-bit quantization and sparsity, to further enhance efficiency. Additionally, the potential of early exits in model inference could be investigated to optimize performance further.

In conclusion, this paper offers a methodological advancement in the field of model quantization, providing both a foundation for future research and immediate practical benefits in the field of computational efficiency for image generation tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (12)
  1. Optimize weight rounding via signed gradient descent for the quantization of llms. arXiv preprint arXiv:2309.05516, 2023.
  2. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  3. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017a.
  4. Gans trained by a two time-scale update rule converge to a nash equilibrium. CoRR, abs/1706.08500, 2017b. URL http://arxiv.org/abs/1706.08500.
  5. Intel. Intel® extension for transformers, 2023. URL "https://github.com/intel/intel-extension-for-transformers. https://github.com/intel/intel-extension-for-transformers.
  6. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018.
  7. Efficient spatially sparse inference for conditional gans and diffusion models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023a.
  8. Q-diffusion: Quantizing diffusion models. arXiv preprint arXiv:2302.04304, 2023b.
  9. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014. URL https://api.semanticscholar.org/CorpusID:14113767.
  10. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  11. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  12. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1972–1981, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Hanwen Chang (4 papers)
  2. Haihao Shen (11 papers)
  3. Yiyang Cai (7 papers)
  4. Xinyu Ye (6 papers)
  5. Zhenzhong Xu (1 paper)
  6. Wenhua Cheng (3 papers)
  7. Kaokao Lv (3 papers)
  8. Weiwei Zhang (80 papers)
  9. Yintong Lu (1 paper)
  10. Heng Guo (94 papers)
Citations (6)