MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization (2405.17873v2)
Abstract: Diffusion models have achieved significant visual generation quality. However, their significant computational and memory costs pose challenge for their application on resource-constrained mobile devices or even desktop GPUs. Recent few-step diffusion models reduces the inference time by reducing the denoising steps. However, their memory consumptions are still excessive. The Post Training Quantization (PTQ) replaces high bit-width FP representation with low-bit integer values (INT4/8) , which is an effective and efficient technique to reduce the memory cost. However, when applying to few-step diffusion models, existing quantization methods face challenges in preserving both the image quality and text alignment. To address this issue, we propose an mixed-precision quantization framework - MixDQ. Firstly, We design specialized BOS-aware quantization method for highly sensitive text embedding quantization. Then, we conduct metric-decoupled sensitivity analysis to measure the sensitivity of each layer. Finally, we develop an integer-programming-based method to conduct bit-width allocation. While existing quantization methods fall short at W8A8, MixDQ could achieve W8A8 without performance loss, and W4A8 with negligible visual degradation. Compared with FP16, we achieve 3-4x reduction in model size and memory cost, and 1.45x latency speedup.
- NVIDIA: Nsight Systems. https://docs.nvidia.com/nsight-systems/index.html, https://docs.nvidia.com/nsight-systems/index.html
- Nvidia: Nvidia cutlass release v3.4 (2024), https://github.com/NVIDIA/cutlass
- PyTorch: PyTorch Memory Management (2023), https://pytorch.org/docs/stable/notes/cuda.html#memory-management
- Team, G.O.T.D.: Or-tools. https://github.com/google/or-tools (Year Accessed)
- Tianchen Zhao (27 papers)
- Xuefei Ning (52 papers)
- Tongcheng Fang (4 papers)
- Enshu Liu (9 papers)
- Guyue Huang (11 papers)
- Zinan Lin (42 papers)
- Shengen Yan (26 papers)
- Guohao Dai (51 papers)
- Yu Wang (939 papers)