Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comprehensive Evaluation of Quantization Strategies for Large Language Models (2402.16775v2)

Published 26 Feb 2024 in cs.CL and cs.AI

Abstract: Increasing the number of parameters in LLMs usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tuned LLMs and the relationship between perplexity and benchmark performance of quantized LLMs are not well understood. Evaluation of quantized LLMs is often limited to LLMing and a few classification tasks, leaving their performance on other benchmarks unclear. To address these gaps, we propose a structured evaluation framework consisting of three critical dimensions: (1) knowledge & capacity, (2) alignment, and (3) efficiency, and conduct extensive experiments across ten diverse benchmarks. Our experimental results indicate that LLMs with 4-bit quantization can retain performance comparable to their non-quantized counterparts, and perplexity can serve as a proxy metric for quantized LLMs on most benchmarks. Furthermore, quantized LLMs with larger parameter scales can outperform smaller LLMs. Despite the memory savings achieved through quantization, it can also slow down the inference speed of LLMs. Consequently, substantial engineering efforts and hardware support are imperative to achieve a balanced optimization of decoding speed and memory consumption in the context of quantized LLMs.

A Comprehensive Evaluation of Quantization Strategies for LLMs

The paper "A Comprehensive Evaluation of Quantization Strategies for LLMs" presents a thorough examination of various quantization methods applied to LLMs. The primary motivation behind this investigation is the increasing computational and memory burden associated with deploying LLMs, especially in resource-constrained environments. Quantization is proposed as a plausible solution to mitigate these limitations by reducing the precision of model parameters, thereby lowering resource demands while maintaining a tolerable performance trade-off.

Key Contributions and Framework

The authors introduce a structured evaluation framework that assesses quantized LLMs across three critical dimensions:

  1. Knowledge Content Capacity: This dimension is evaluated through benchmarks such as MMLU and C-EVAL, which measure the model's comprehension across various knowledge domains.
  2. Alignment: The adherence of models to human values and preferences is gauged using benchmarks like FollowBench, TruthfulQA, and BBQ.
  3. Efficiency: This is measured in terms of computational aspects such as memory consumption and inference speed.

The framework is tested using ten diverse benchmarks, highlighting the models' performance in both knowledge understanding and alignment, alongside their computational efficiency.

Experimental Findings

The paper reveals several noteworthy outcomes:

  • 4-bit Quantization Retains Performance: Models quantized to 4 bits demonstrate performance comparable to their full-precision counterparts across most benchmarks. This suggests a viable path for deploying memory-efficient models without significantly sacrificing accuracy.
  • Perplexity as a Proxy: The perplexity of quantized models was found to correlate well with performance on various tasks, validating its utility as an indirect measure of model efficacy in the quantized setting.
  • Outlier Weight Isolation: The paper highlights the significance of isolating outlier weights for extreme quantization levels (e.g., 2 bits). Methods like SpQR, which effectively manage such weights, perform better at lower precisions compared to alternatives like GPTQ.
  • Graphical Hardness and Quantization: The quantized models' efficiency, especially concerning parallel computation, is hampered by current hardware limitations, stressing the need for tailored hardware optimizations for low-precision arithmetic.

Implications and Future Directions

This paper underscores the practicality of deploying quantized LLMs under constrained resources, suggesting that 4-bit quantization offers a commendable balance between efficiency and performance. It also points out the potential of quantized models with larger parameter counts outperforming smaller, non-quantized models given equivalent resource usage. This observation could drive a shift towards optimizing larger models for edge deployments in the future.

Furthermore, the paper hints at unresolved challenges, particularly in efficiently scaling current quantization techniques with existing hardware, suggesting avenues for future research in hardware-aligned algorithmic development.

Conclusion

In conclusion, the paper presents a compelling case for the utility of quantization in efficiently deploying LLMs. It offers robust evidence supporting the viability of lower-bit quantization without substantial performance loss, backed by a comprehensive evaluation framework. The insights drawn from this paper not only inform current practices but also pave the way for continued innovation in AI model compression techniques, potentially influencing future developments in scalable AI technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861.
  2. Qwen technical report. CoRR, abs/2309.16609.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023.
  5. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 632–642. The Association for Computational Linguistics.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  7. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. Just Accepted.
  8. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  9. No language left behind: Scaling human-centered machine translation. CoRR, abs/2207.04672.
  10. Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  11. Qlora: Efficient finetuning of quantized llms. CoRR, abs/2305.14314.
  12. Spqr: A sparse-quantized representation for near-lossless LLM weight compression. CoRR, abs/2306.03078.
  13. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR.
  14. GPTQ: accurate post-training quantization for generative pre-trained transformers. CoRR, abs/2210.17323.
  15. OPTQ: accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  16. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 3356–3369. Association for Computational Linguistics.
  17. A survey of quantization methods for efficient neural network inference. CoRR, abs/2103.13630.
  18. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. Trans. Assoc. Comput. Linguistics, 10:522–538.
  19. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  20. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1693–1701.
  21. Unlock predictable scaling from emergent abilities. CoRR, abs/2310.03262.
  22. Yufei Huang and Deyi Xiong. 2023. CBBQ: A chinese bias benchmark dataset curated with human-ai collaboration for large language models. CoRR, abs/2306.16244.
  23. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  24. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 2704–2713. Computer Vision Foundation / IEEE Computer Society.
  25. Followbench: A multi-level fine-grained constraints following benchmark for large language models. CoRR, abs/2310.20410.
  26. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. CoRR, abs/2305.14152.
  27. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 13171–13189. Association for Computational Linguistics.
  28. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 431–469. Association for Computational Linguistics.
  29. OWQ: lessons learned from activation outliers for weight quantization in large language models. CoRR, abs/2306.02272.
  30. CMMLU: measuring massive multitask language understanding in chinese. CoRR, abs/2306.09212.
  31. Yucheng Li. 2023. Estimating contamination via perplexity: Quantifying memorisation in language model evaluation. CoRR, abs/2309.10677.
  32. Holistic evaluation of language models. CoRR, abs/2211.09110.
  33. AWQ: activation-aware weight quantization for LLM compression and acceleration. CoRR, abs/2306.00978.
  34. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Association for Computational Linguistics.
  35. Do emergent abilities exist in quantized large language models: An empirical study. CoRR, abs/2307.08072.
  36. Alignbench: Benchmarking chinese alignment of large language models. CoRR, abs/2311.18743.
  37. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. CoRR, abs/2308.05374.
  38. LLM-QAT: data-free quantization aware training for large language models. CoRR, abs/2305.17888.
  39. Are emergent abilities in large language models just in-context learning? CoRR, abs/2309.01809.
  40. Gpteval: A survey on assessments of chatgpt and GPT-4. CoRR, abs/2308.12488.
  41. The penn treebank: Annotating predicate argument structure. In Human Language Technology, Proceedings of a Workshop held at Plainsboro, New Jerey, USA, March 8-11, 1994. Morgan Kaufmann.
  42. Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  43. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, pages 280–290. ACL.
  44. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1797–1807. Association for Computational Linguistics.
  45. Proving test set contamination in black box language models. CoRR, abs/2310.17623.
  46. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  47. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2086–2105. Association for Computational Linguistics.
  48. Instruction tuning with GPT-4. CoRR, abs/2304.03277.
  49. Toolllm: Facilitating large language models to master 16000+ real-world apis. CoRR, abs/2307.16789.
  50. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  51. Pangu-ΣΣ\Sigmaroman_Σ: Towards trillion parameter language model with sparse heterogeneous computing. CoRR, abs/2303.10845.
  52. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  53. Are emergent abilities of large language models a mirage? In Thirty-seventh Conference on Neural Information Processing Systems.
  54. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1073–1083. Association for Computational Linguistics.
  55. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. CoRR, abs/1911.02150.
  56. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615.
  57. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  58. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  59. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  60. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022.
  61. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 1648–1665. Association for Computational Linguistics.
  62. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020, pages 38–45. Association for Computational Linguistics.
  63. Training trajectories of language models across scales. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13711–13738. Association for Computational Linguistics.
  64. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 38087–38099. PMLR.
  65. Rethinking benchmark and contamination for language models with rephrased samples. CoRR, abs/2311.04850.
  66. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  67. Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 8653–8665. Association for Computational Linguistics.
  68. Hui Zeng. 2023. Measuring massive multitask chinese understanding. CoRR, abs/2304.12986.
  69. A survey of large language models. CoRR, abs/2303.18223.
  70. Instruction-following evaluation for large language models. CoRR, abs/2311.07911.
  71. A survey on model compression for large language models. CoRR, abs/2308.07633.
  72. Through the lens of core competency: Survey on evaluation of large language models. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum), pages 88–109, Harbin, China. Chinese Information Processing Society of China.
  73. Representation engineering: A top-down approach to AI transparency. CoRR, abs/2310.01405.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Renren Jin (17 papers)
  2. Jiangcun Du (3 papers)
  3. Wuwei Huang (3 papers)
  4. Wei Liu (1135 papers)
  5. Jian Luan (50 papers)
  6. Bin Wang (750 papers)
  7. Deyi Xiong (103 papers)
Citations (11)
Youtube Logo Streamline Icon: https://streamlinehq.com