Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SqueezeLLM: Dense-and-Sparse Quantization (2306.07629v4)

Published 13 Jun 2023 in cs.CL and cs.LG

Abstract: Generative LLMs have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is available at https://github.com/SqueezeAILab/SqueezeLLM.

Overview of "SqueezeLLM: Dense-and-Sparse Quantization"

The paper "SqueezeLLM: Dense-and-Sparse Quantization" addresses the significant challenge of deploying Generative LLMs for inference, given their extensive resource requirements. This challenge has commonly necessitated the use of multi-GPU inference pipelines, which are not only complex but also costly. Alternative solutions, such as using smaller and inherently less performant models, do not meet the rigorous demands of real-world applications. The paper proposes SqueezeLLM, a novel post-training quantization framework that effectively reduces the memory size of LLMs while largely maintaining model performance.

Core Contributions

The SqueezeLLM framework introduces two main innovations aimed at enhancing the quantization of LLMs to combat the 'Memory Wall' issue, which identifies memory bandwidth, rather than computational power, as the critical bottleneck in LLM inference:

  1. Sensitivity-Based Non-Uniform Quantization: This approach involves a novel quantization strategy that allocates different bit precisions based on sensitivity, determined using second-order Hessian information. This allows for the differential quantization of parameters, leveraging k-means for non-uniform cluster formation of weights. It effectively reduces the precision of less critical parameters while preserving precision in more sensitive areas, achieving a substantial reduction in perplexity for quantized models compared to uniform quantization methods.
  2. Dense-and-Sparse Decomposition: This method addresses the distribution of weight values by separating them into dense and sparse components, where the sparse matrix retains outlier and sensitive values at full precision. This decomposition allows for more effective quantization of the remaining dense matrix, which can be more aggressively compressed without a significant loss in model performance. This improves the quantization resolution, especially in a low-precision setting.

Experimental Results

SqueezeLLM was applied to several LLMs, including the LLaMA models, achieving notable improvements in performance:

  • For 3-bit quantization of the LLaMA-7B model, SqueezeLLM improves perplexity by up to 2.1 times compared to state-of-the-art methods with the same memory constraint.
  • The framework enables a speedup of up to 2.3 times in GPU utilization over the baselines, while maintaining a minor accuracy trade-off.
  • When evaluated across tasks in LLMing and instruction-following using benchmarks like C4, WikiText-2, and MMLU, SqueezeLLM consistently outperforms current post-training quantization methods like GPTQ and AWQ.

Implications and Future Directions

The results assert that SqueezeLLM offers a feasible route to deploy LLMs in resource-constrained environments. By significantly reducing memory bandwidth requirements and inference latency, this approach simplifies the deployment of memory-bound tasks, potentially altering infrastructure strategies by lessening dependency on expensive, high-memory GPUs.

On a practical level, the ability of this framework to maintain model accuracy while compressing model size opens avenues for deploying sophisticated NLP systems on more cost-effective hardware platforms. Theoretically, it enhances the understanding of quantization impacts on LLMs and provides a more nuanced view of balancing model precision with performance through innovative decomposition strategies.

Looking forward, it is worth exploring how the techniques in SqueezeLLM can be adapted to other architectures, such as encoder-only or encoder-decoder models common in a wide array of real-world NLP tasks. Moreover, investigating the integration of SqueezeLLM with dynamic optimization techniques or in conjunction with other compression strategies such as pruning could further enhance LLM efficiency, broadening the scope for scalable AI applications with limited computational resources.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. BinaryBERT: Pushing the limit of BERT quantization. arXiv preprint arXiv:2012.15701, 2020.
  2. Understanding and overcoming the challenges of efficient Transformer quantization. arXiv preprint arXiv:2109.12948, 2021.
  3. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  4. ZeroQ: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13169–13178, 2020.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  6. PALM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  7. Extremely low bit transformer quantization for on-device neural machine translation. arXiv preprint arXiv:2009.07453, 2020.
  8. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems.
  9. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. arXiv preprint arXiv:2306.03078, 2023.
  10. HAWQ-V2: Hessian Aware trace-Weighted Quantization of neural networks. NeurIPS’19 workshop on Beyond First-Order Optimization Methods in Machine Learning., 2019.
  11. GLAM: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.  5547–5569. PMLR, 2022.
  12. Georgii Evtushenko. Sparse Matrix-Vector Multiplication with CUDA. https://medium.com/analytics-vidhya/sparse-matrix-vector-multiplication-with-cuda-42d191878e8f, 2019.
  13. Balanced csr sparse matrix-vector product on graphics processors. In Euro-Par 2017: Parallel Processing: 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28–September 1, 2017, Proceedings 23, pp.  697–709. Springer, 2017.
  14. GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  15. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  16. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021a.
  17. AI and Memory Wall. https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8, 2021b.
  18. GPTQ-For-LLaMA. https://github.com/qwopqwop200/gptq-for-llama.
  19. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems, pp.  164–171, 1993.
  20. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp.  293–299. IEEE, 1993.
  21. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  22. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  23. Output sensitivity-aware detr quantization. 2023.
  24. Mr. BiQ: Post-training non-uniform quantization based on minimizing the reconstruction error. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12329–12338, 2022.
  25. I-BERT: Integer-only bert quantization. arXiv preprint arXiv:2101.01321, 2021.
  26. Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023.
  27. Bert busters: Outlier dimensions that disrupt transformers. arXiv preprint arXiv:2105.06990, 2021.
  28. Optimal brain damage. In Advances in neural information processing systems, pp.  598–605, 1990.
  29. Q-diffusion: Quantizing diffusion models. arXiv preprint arXiv:2302.04304, 2023.
  30. Awq: Activation-aware weight quantization for llm compression and acceleration. 2023.
  31. NoisyQuant: Noisy bias-enhanced post-training activation quantization for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  20321–20330, 2023.
  32. Pointer sentinel mixture models, 2016.
  33. Non-uniform step size quantization for accurate post-training quantization. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI, pp.  658–673. Springer, 2022.
  34. David A Patterson. Latency lags bandwith. Communications of the ACM, 47(10):71–75, 2004.
  35. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  36. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  37. Q-BERT: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  8815–8821, 2020.
  38. Post-training sparsity-aware quantization. Advances in Neural Information Processing Systems, 34:17737–17748, 2021.
  39. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  40. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  41. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp.  10347–10357. PMLR, 2021.
  42. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  44. Outlier suppression: Pushing the limit of low-bit transformer language models. arXiv preprint arXiv:2209.13325, 2022.
  45. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. arXiv preprint arXiv:2304.09145, 2023.
  46. ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
  47. RPTQ: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023.
  48. GOBO: Quantizing attention-based nlp models for low latency and energy efficient inference. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.  811–824. IEEE, 2020.
  49. Q8BERT: Quantized 8bit bert. arXiv preprint arXiv:1910.06188, 2019.
  50. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  51. TernaryBERT: Distillation-aware ultra-low bit bert. arXiv preprint arXiv:2009.12812, 2020.
  52. Qd-bev: Quantization-aware view-guided distillation for multi-view 3d object detection. 2023.
  53. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pp.  7543–7552. PMLR, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Sehoon Kim (30 papers)
  2. Coleman Hooper (16 papers)
  3. Amir Gholami (60 papers)
  4. Zhen Dong (87 papers)
  5. Xiuyu Li (24 papers)
  6. Sheng Shen (68 papers)
  7. Michael W. Mahoney (233 papers)
  8. Kurt Keutzer (199 papers)
Citations (126)
Youtube Logo Streamline Icon: https://streamlinehq.com