Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extreme Compression of Large Language Models via Additive Quantization (2401.06118v4)

Published 11 Jan 2024 in cs.LG and cs.CL
Extreme Compression of Large Language Models via Additive Quantization

Abstract: The emergence of accurate open LLMs has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of "extreme" LLM compression-defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter-from the point of view of classic methods in Multi-Codebook Quantization (MCQ). Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval to advance the state-of-the-art in LLM compression, via two innovations: 1) learned additive quantization of weight matrices in input-adaptive fashion, and 2) joint optimization of codebook parameters across each transformer blocks. Broadly, AQLM is the first scheme that is Pareto optimal in terms of accuracy-vs-model-size when compressing to less than 3 bits per parameter, and significantly improves upon all known schemes in the extreme compression (2bit) regime. In addition, AQLM is practical: we provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed, while executing in a much smaller memory footprint.

Introduction

LLMs have seen significant advancement, attracting industrial and popular interest due to their precision and the potential for localized operation on user devices. Compression of these models is vital for their deployment on hardware with limited computation and memory resources. Quantization, the primary approach for post-training compression, aims to reduce the bit-width of model parameters and consequently improve the memory footprint and computational efficiency of the models. However, the quest for high compression often introduces a trade-off where extreme quantization leads to accuracy loss. This paper presents a novel approach to LLM compression utilizing Additive Quantization (AQ), advancing the state-of-the-art in maintaining accuracy under tight compression budgets.

Methodology

The paper details a modified version of AQ, a classic algorithm from the multi-codebook quantization (MCQ) family, adapted to compress LLM weights while preserving the functionality of the models. The new approach, named Additive Quantization for LLMs (AQLM), reformulates the standard AQ optimization problem to minimize the error in the LLM layer outputs rather than the weights themselves. By modifying the algorithm to be instance-aware and incorporating layer calibration, AQLM achieves a homogeneous and simple quantization format that maintains high accuracy even at extreme compression levels like 2 bits per parameter.

Results

In the results section, AQLM showcases superior performance when compressing and quantizing LLMs of various sizes, demonstrating significant improvement over existing methods across several bit compression ranges. The paper presents extensive evaluations using popular benchmarks such as the Llama 2 models, measuring both perplexity and zero-shot task accuracy. Notably, substantial improvements in perplexity are recorded, particularly at the extreme low-end of 2-bit parameter compression.

Conclusion

AQLM stands as a significant contribution to the field of LLM quantization, showing that it is possible to maintain high accuracy even at low bit counts. It serves as a critical step toward making complex LLMs accessible within a more extensive range of environments, especially those with limited resources. The release of its implementation further supports ongoing research and development, providing a foundation for future exploration into efficient LLM deployment on consumer-grade devices. Further work aims to streamline AQLM's computational process and explore optimal parameter settings for model compression.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  931–938, 2014.
  2. Besag, J. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society Series B: Statistical Methodology, 48(3):259–279, 1986.
  3. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023.
  4. Multiplying matrices without multiplying. In International Conference on Machine Learning, pp.  992–1004. PMLR, 2021.
  5. A generalization of isolated word recognition using vector quantization. In ICASSP ’83. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 8, pp.  1021–1024, 1983. doi: 10.1109/ICASSP.1983.1171915.
  6. Quip: 2-bit quantization of large language models with guarantees, 2023.
  7. Deep neural network quantization via layer-wise optimization using limited training data. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3329–3336, Jul. 2019. doi: 10.1609/aaai.v33i01.33013329. URL https://ojs.aaai.org/index.php/AAAI/article/view/4206.
  8. Approximate nearest neighbor search by residual vector quantization. Sensors, 10(12):11259–11273, 2010.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  10. Computer, T. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  11. The case for 4-bit precision: k-bit inference scaling laws. arXiv preprint arXiv:2212.09720, 2022.
  12. LLM.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022.
  13. QLoRA: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023a.
  14. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078, 2023b.
  15. Are we there yet? product quantization and its hardware acceleration. ArXiv, abs/2305.18334, 2023. URL https://api.semanticscholar.org/CorpusID:258967539.
  16. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022a.
  17. Optimal Brain Compression: A framework for accurate post-training quantization and pruning. arXiv preprint arXiv:2208.11580, 2022b. Accepted to NeurIPS 2022, to appear.
  18. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  19. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence, 36(4):744–755, 2013.
  20. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
  21. Gray, R. Vector quantization. IEEE ASSP Magazine, 1(2):4–29, 1984. doi: 10.1109/MASSP.1984.1162229.
  22. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning (ICML), 2021.
  23. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
  24. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
  25. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015.
  26. BRECQ: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations (ICLR), 2021.
  27. Performance guaranteed network acceleration via high-order residual quantization, 2017.
  28. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  29. Revisiting additive quantization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp.  137–153. Springer, 2016.
  30. Lsq++: Lower running time and higher recall in multi-codebook quantization. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  491–506, 2018.
  31. Look-ups are not (yet) all you need for deep learning inference. ArXiv, abs/2207.05808, 2022. URL https://api.semanticscholar.org/CorpusID:250491319.
  32. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  33. Up or down? Adaptive rounding for post-training quantization. In International Conference on Machine Learning (ICML), 2020.
  34. Cartesian k-means. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp.  3017–3024, 2013.
  35. Competitive quantization for approximate nearest neighbor search. IEEE Transactions on Knowledge and Data Engineering, 28(11):2884–2894, 2016. doi: 10.1109/TKDE.2016.2597834.
  36. nuQmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022.
  37. PyTorch: An imperative style, high-performance deep learning library. In Conference on Neural Information Processing Systems (NeurIPS). 2019.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  39. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, 2021. doi: 10.1145/3474381. URL https://doi.org/10.1145/3474381.
  40. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  41. PiQA: An algebra for querying protein data sets. In International Conference on Scientific and Statistical Database Management, 2003.
  42. TII UAE. The Falcon family of large language models. https://huggingface.co/tiiuae/falcon-40b, May 2023.
  43. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  44. Quip#: Quip with lattice codebooks.
  45. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
  46. Towards accurate post-training network quantization via bit-split and stitching. In International Conference on Machine Learning (ICML), 2020.
  47. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  48. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
  49. Hellaswag: Can a machine really finish your sentence? In Korhonen, A., Traum, D. R., and Màrquez, L. (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p19-1472.
  50. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  51. Composite quantization for approximate nearest neighbor search. In International Conference on Machine Learning, pp.  838–846. PMLR, 2014.
  52. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology, 32(4):667–682, Jul 2017. ISSN 1860-4749. doi: 10.1007/s11390-017-1750-y. URL https://doi.org/10.1007/s11390-017-1750-y.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Vage Egiazarian (16 papers)
  2. Andrei Panferov (7 papers)
  3. Denis Kuznedelev (21 papers)
  4. Elias Frantar (24 papers)
  5. Artem Babenko (43 papers)
  6. Dan Alistarh (133 papers)
Citations (57)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com