BitsFusion: 1.99 bits Weight Quantization of Diffusion Model (2406.04333v2)

Published 6 Jun 2024 in cs.CV

Abstract: Diffusion-based image generation models have achieved great success in recent years by showing the capability of synthesizing high-quality content. However, these models contain a huge number of parameters, resulting in a significantly large model size. Saving and transferring them is a major bottleneck for various applications, especially those running on resource-constrained devices. In this work, we develop a novel weight quantization method that quantizes the UNet from Stable Diffusion v1.5 to 1.99 bits, achieving a model with 7.9X smaller size while exhibiting even better generation quality than the original one. Our approach includes several novel techniques, such as assigning optimal bits to each layer, initializing the quantized model for better performance, and improving the training strategy to dramatically reduce quantization error. Furthermore, we extensively evaluate our quantized model across various benchmark datasets and through human evaluation to demonstrate its superior generation quality.

PDF HTML Abstract

BitsFusion: 1.99 Bits Weight Quantization of Diffusion Model

Overview

The paper, "BitsFusion: 1.99 bits Weight Quantization of Diffusion Model," explores a method to significantly reduce the size of large-scale diffusion models while maintaining or even improving their performance. The authors introduce BitsFusion, a novel weight quantization method that compresses the UNet from Stable Diffusion v1.5 to 1.99 bits per weight. This results in a model size that is 7.9 times smaller than its full-precision counterpart without sacrificing, and in some cases, even enhancing, the image generation quality. Through rigorous evaluations, the paper demonstrates the efficacy of BitsFusion across various benchmarks and via human assessment.

Strong Numerical Results and Claims

The most striking numerical result in the paper is the 7.9 times reduction in model size achieved by BitsFusion. The quantized model, specifically the UNet component of Stable Diffusion v1.5, is reduced from 1.72 GB in a half-precision FP16 format to just 219 MB. Furthermore, this significant compression is accompanied by an improvement in the generation quality, as evidenced by higher CLIP scores and better text-image alignment in various benchmarks. These results are supported by thorough evaluations on datasets such as TIFA, GenEval, and the MS-COCO validation set, alongside human evaluations on PartiPrompts. In all these evaluations, the quantized model consistently outperformed the original in terms of image quality and text-image alignment.

Methodology

Mixed-Precision Quantization

The authors introduce a mixed-precision quantization strategy for diffusion models. Their approach involves an in-depth analysis of quantization error at the per-layer level. Two primary metrics were used to evaluate this error: Mean Squared Error (MSE) and CLIP score degradation. By considering both the quantization error and the number of parameters in each layer, the authors developed a sensitivity score to guide the allocation of bit widths. This score balances the trade-off between error minimization and storage efficiency. Layers with higher sensitivity and larger impact on the overall model size are allocated more bits, while less critical layers are pushed to extremely low bit representations (e.g., 1.99 bits on average).

Innovations in Initialization and Training

To address the challenges of extremely low-bit quantization, several novel techniques were introduced:

Time Embedding Pre-computing and Caching: By pre-computing and caching the time embeddings and their projections, the authors bypass the need to quantize these sensitive layers, achieving significant storage reductions.
Adding Balance Integer: The method ensures more symmetric quantization by adjusting the integer values during mapping, crucial for maintaining performance at very low bit levels.
Scaling Factor Initialization via Alternating Optimization: This technique refines the scaling factors through an iterative optimization process, leading to better initial quantized weight representations.

Two-Stage Training Pipeline

The training pipeline was designed to minimize the quantization error while preserving, and potentially enhancing, the model's generative capabilities:

Stage-I: Quantization-aware distillation is used, where the quantized model learns from a full-precision teacher model, incorporating both noise and feature-based losses.
Stage-II: Fine-tuning of the quantized model with adjusted time step sampling to mitigate errors that disproportionately affect certain time steps.

Implications and Future Directions

Practical Implications

The practical implications of BitsFusion are substantial. The method greatly reduces the storage and computational requirements of diffusion models, making them more viable for deployment on resource-constrained devices such as mobile phones and embedded systems. The ability to generate high-quality images with a fraction of the computational budget opens new avenues for real-time applications in image synthesis, content creation, and augmented reality.

Theoretical Implications

BitsFusion introduces a comprehensive framework that combines mixed-precision quantization with advanced initialization and training techniques. This approach could generalize to other large-scale models, suggesting new research directions in model compression. Future developments might explore the extension of these quantization techniques to other components of the diffusion models, such as VAE and text encoders, potentially leading to even more compact architectures.

Speculation on Future Developments

With the demonstrated success of BitsFusion, future research may focus on optimizing inference algorithms to further leverage the benefits of quantization. Additionally, investigating the model's robustness to varied input distributions and exploring adaptive quantization strategies could enhance the versatility and performance of quantized models. As AI continues to permeate more aspects of technology, the principles outlined in this paper are likely to guide efforts in making models more efficient and widely applicable.

In summary, the "BitsFusion" paper offers a sophisticated and methodical approach to reducing the size of diffusion models while maintaining, and sometimes enhancing, their performance. It sets a significant precedent for future work in model quantization and efficient AI deployment.