Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models (2405.14917v1)

Published 23 May 2024 in cs.LG and cs.CL
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Abstract: LLMs achieve remarkable performance in natural language understanding but require substantial computation and memory resources. Post-training quantization (PTQ) is a powerful compression technique extensively investigated in LLMs. However, existing PTQ methods are still not ideal in terms of accuracy and efficiency, especially with below 4 bit-widths. Standard PTQ methods using group-wise quantization suffer difficulties in quantizing LLMs accurately to such low-bit, but advanced methods remaining high-precision weights element-wisely are hard to realize their theoretical hardware efficiency. This paper presents a Salience-Driven Mixed-Precision Quantization scheme for LLMs, namely SliM-LLM. The scheme exploits the salience distribution of weights to determine optimal bit-width and quantizers for accurate LLM quantization, while aligning bit-width partition to groups for compact memory usage and fast integer inference. Specifically, the proposed SliM-LLM mainly relies on two novel techniques: (1) Salience-Determined Bit Allocation utilizes the clustering characteristics of salience distribution to allocate the bit-widths of each group, increasing the accuracy of quantized LLMs and maintaining the inference efficiency; (2) Salience-Weighted Quantizer Calibration optimizes the parameters of the quantizer by considering the element-wise salience within the group, balancing the maintenance of salient information and minimization of errors. Comprehensive experiments show that SliM-LLM significantly improves the accuracy of LLMs at ultra-low bits, e.g., 2-bit LLaMA-7B achieves a 5.5-times memory-saving than original model on NVIDIA A800 GPUs, and 48% decrease of perplexity compared to the state-of-the-art gradient-free PTQ method. Moreover, SliM-LLM+, which is integrated from the extension of SliM-LLM with gradient-based quantizers, further reduces perplexity by 35.1%.

An Analysis of SliM-LLM: Salience-Driven Mixed-Precision Quantization for Efficient LLM Deployment

The paper introduces SliM-LLM, a novel approach to efficient quantization of LLMs through salience-driven mixed-precision. As researchers continue to grapple with the computational demands and memory footprints of modern LLMs, effective post-training quantization (PTQ) methods remain essential to leverage these models in resource-constrained environments. This work addresses the shortcomings of existing PTQ methods, particularly at ultra-low bit-width levels.

Core Contributions and Techniques

The authors propose SliM-LLM, a quantization framework that optimizes the precision of model weights by considering the salience distribution—an assessment of weight importance based on their influence on model output. This approach introduces two primary innovations:

  1. Salience-Determined Bit Allocation (SBA): SBA allocates bit-widths to quantization groups based on their salience, integrating a structured approach to maintain high-inference efficiency. This technique divides weights into groups and assigns bit-widths by analyzing the importance ranking of these groups, effectively minimizing output information loss quantified via Kullback-Leibler divergence rather than traditional mean square error metrics.
  2. Salience-Weighted Quantizer Calibration (SQC): Focused on improving the quantizer's responsiveness to locally salient weights within groups, SQC utilizes search parameters to optimize the scale and zero-point configuration of the quantizer. This calibration is critical for maintaining a balance between preserving salient data and minimizing errors.

Numerical Results

The empirical evaluations show that SliM-LLM significantly outperforms existing methods like AWQ, GPTQ, and QuIP, especially in the 2-bit quantization scenario. For instance, the paper reports that a 2-bit quantization of LLaMA-7B achieves a 5.5-fold reduction in memory usage, significantly improving over existing gradient-free methods by reducing perplexity by 48%. The enhanced version, SliM-LLM+ which incorporates gradient-based quantizers, further reduces perplexity by 35.1%. Such results underscore the framework's proficiency in balancing the trade-off between accuracy and hardware efficiency.

Implications and Speculations on Future AI Development

From a practical standpoint, SliM-LLM enables the deployment of LLMs in environments with limited computational resources, thus broadening the applicability of such models to edge computing and real-time applications. The structured approach to mixed-precision quantization reduces the hardware burden traditionally associated with quantized models, paving the way for scalable and efficient AI solutions.

Theoretically, this work suggests that further exploration of salience-based techniques could extend beyond quantization to enhance various model compression strategies such as pruning or distillation. The emphasis on salience—where not all model parameters contribute equally to output—may influence broader model optimization paradigms.

Conclusion

In summary, SliM-LLM offers a promising advancement in PTQ, aiming to optimize the computational efficiency of LLMs while minimizing losses in accuracy. By leveraging the concept of salience, this framework addresses both the precision requirements and computational challenges in deploying LLMs in practical applications. Future research may build upon these insights, leading to even more sophisticated methodologies for AI model compression and deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  3. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  4. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
  5. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
  6. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36, 2024.
  7. DB-LLM: Accurate dual-binarization for efficient llms. arXiv preprint arXiv:2402.11960, 2024.
  8. Vicuna: An open-source chatbot impressing GPT-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  9. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  10. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  11. LLM.int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  12. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. arXiv preprint arXiv:2306.03078, 2023.
  13. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118, 2024.
  14. Optimal brain compression: A framework for accurate post-training quantization and pruning. Advances in Neural Information Processing Systems, 35:4475–4488, 2022.
  15. SparseGPT: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR, 2023.
  16. GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  17. Compressing large-scale transformer-based models: A case study on bert. Transactions of the Association for Computational Linguistics, 9:1061–1080, 2021.
  18. APTQ: Attention-aware post-training mixed-precision quantization for large language models. arXiv preprint arXiv:2402.14866, 2024.
  19. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  20. BiLLM: Pushing the limit of post-training quantization for llms. arXiv preprint arXiv:2402.04291, 2024.
  21. How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study. arXiv preprint arXiv:2404.14047, 2024.
  22. Matrix inversion using cholesky decomposition. In 2013 signal processing: Algorithms, architectures, arrangements, and applications (SPA), pages 70–72. IEEE, 2013.
  23. OWQ: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272, 2023.
  24. LLM-MQ: Mixed-precision quantization for efficient LLM deployment. In Advances in Neural Information Processing Systems (NeurIPS) ENLSP Workshop, 2024.
  25. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024.
  26. AWQ: Activation-aware weight quantization for LLM compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  27. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888, 2023.
  28. Affinequant: Affine transformation quantization for large language models. arXiv preprint arXiv:2403.12544, 2024.
  29. Donald W Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963.
  30. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  31. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
  32. Mitigating the impact of outlier channels for language model quantization with activation regularization. arXiv preprint arXiv:2404.03605, 2024.
  33. Accurate LoRA-Finetuning Quantization of LLMs via Information Retention. arXiv preprint arXiv:2402.05445, 2024.
  34. Bibench: Benchmarking and analyzing network binarization. arXiv preprint arXiv:2301.11233, 2023.
  35. Distribution-sensitive information retention for accurate binary neural network. International Journal of Computer Vision, 131(1):26–47, 2023.
  36. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  37. PB-LLM: Partially binarized large language models. arXiv preprint arXiv:2310.00034, 2023.
  38. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
  39. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  40. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  43. Quip#: Even better LLM quantization with hadamard incoherence and lattice codebooks. arXiv preprint arXiv:2402.04396, 2024.
  44. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  45. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  46. Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717, 2023.
  47. Efficient and affordable post-training quantization for large-scale transformers. URL https://arxiv. org/abs/2206.01861, 2022.
  48. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  49. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  50. Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023.
  51. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  52. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Wei Huang (318 papers)
  2. Haotong Qin (60 papers)
  3. Yangdong Liu (4 papers)
  4. Yawei Li (72 papers)
  5. Xianglong Liu (128 papers)
  6. Luca Benini (362 papers)
  7. Michele Magno (118 papers)
  8. Xiaojuan Qi (133 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets