Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Quantization Affects Confidence of Large Language Models? (2405.00632v1)

Published 1 May 2024 in cs.CL and cs.AI

Abstract: Recent studies introduced effective compression techniques for LLMs via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibration of quantized models, considering factors such as LLM type and scale as contributors to quantization loss. Firstly, we reveal that quantization with GPTQ to 4-bit results in a decrease in confidence regarding true labels, with varying impacts observed among different LLMs. Secondly, we observe fluctuations in the impact on confidence across different scales. Finally, we propose an explanation for quantization loss based on confidence levels, indicating that quantization disproportionately affects samples where the full model exhibited low confidence levels in the first place.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. On the calibration of massively multilingual language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4310–4323, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  2. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
  3. PiQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. CoRR, abs/1803.05457.
  7. Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online. Association for Computational Linguistics.
  8. LLM.int8(): 8-bit matrix multiplication for transformers at scale. ArXiv, abs/2208.07339.
  9. GPTQ: Accurate post-training quantization for generative pre-trained transformers. ArXiv, abs/2210.17323.
  10. OPTQ: Accurate quantization for generative pre-trained transformers. In International Conference on Learning Representations.
  11. A framework for few-shot language model evaluation.
  12. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
  13. Manish Gupta and Puneet Agrawal. 2020. Compression of deep learning models for text: A survey. ACM Trans. Knowl. Discov. Data, 16:61:1–61:55.
  14. Bridging fairness and environmental sustainability in natural language processing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7817–7836, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  15. Quantization and training of neural networks for efficient integer-arithmetic-only inference. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2704–2713.
  16. Mistral 7B. ArXiv, abs/2310.06825.
  17. How can we know when language models know? On the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
  18. How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
  19. Scaling laws for neural language models. ArXiv, abs/2001.08361.
  20. BLOOM: A 176b-parameter open-access multilingual language model. ArXiv.
  21. Do emergent abilities exist in quantized large language models: An empirical study. ArXiv, abs/2307.08072.
  22. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing.
  23. LSDSem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pages 46–51, Valencia, Spain. Association for Computational Linguistics.
  24. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the … AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 2015:2901–2907.
  25. Measuring calibration in deep learning. ArXiv, abs/1904.01685.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  27. A comparative study on the impact of model compression techniques on fairness in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15762–15782, Toronto, Canada. Association for Computational Linguistics.
  28. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
  29. Compression of generative pre-trained language models via quantization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4821–4836, Dublin, Ireland. Association for Computational Linguistics.
  30. LLaMA: Open and efficient foundation language models. ArXiv, abs/2302.13971.
  31. Exploring predictive uncertainty and calibration in NLP: A study on the impact of method & data scarcity. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2707–2735, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  32. Emergent abilities of large language models. ArXiv, abs/2206.07682.
  33. Language models are few-shot multilingual learners. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 1–15, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  34. SmoothQuant: Accurate and efficient post-training quantization for large language models. ArXiv, abs/2211.10438.
  35. Uncertainty quantification with pre-trained language models: A large-scale empirical analysis. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7273–7284, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  36. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  37. OPT: Open pre-trained transformer language models. ArXiv, abs/2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Irina Proskurina (5 papers)
  2. Luc Brun (17 papers)
  3. Guillaume Metzler (4 papers)
  4. Julien Velcin (37 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com