Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPTVQ: The Blessing of Dimensionality for LLM Quantization (2402.15319v1)

Published 23 Feb 2024 in cs.LG and cs.CL

Abstract: In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to LLMs. Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. K-means++ the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp.  1027–1035, 2007.
  3. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  4. Dkm: Differentiable k-means clustering layer for neural network compression. arXiv preprint arXiv:2108.12659, 2021.
  5. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  7. LLM-codebook for extreme compression of large language models, 2024. URL https://openreview.net/forum?id=nMbWsXPUVL.
  8. Training with quantization noise for extreme model compression. arXiv preprint arXiv:2004.07320, 2020.
  9. Optimal brain compression: A framework for accurate post-training quantization and pruning. Advances in Neural Information Processing Systems, 35:4475–4488, 2022.
  10. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  11. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  12. Vector quantization and signal compression, volume 159. Springer Science & Business Media, 2012.
  13. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pp.  291–326. Chapman and Hall/CRC, 2022.
  14. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
  15. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  16. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp.  293–299. IEEE, 1993.
  17. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp.  1389–1397, 2017.
  18. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  19. Bias in pruned vision models: In-depth analysis and countermeasures. IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pp.  24364–24373, 2023.
  20. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  21. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  22. Winogrande: An adversarial winograd schema challenge at scale. 2019.
  23. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  24. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  25. Permute, quantize, and fine-tune: Efficient compression of neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15699–15708, 2021.
  26. Pointer sentinel mixture models, 2016.
  27. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp.  7197–7206. PMLR, 2020.
  28. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
  29. With shared microexponents, a little shifting goes a long way. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pp.  Article No.: 83, 2023.
  30. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  31. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
  32. Woodfisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 33:18098–18109, 2020.
  33. And the bit goes down: Revisiting the quantization of neural networks. arXiv preprint arXiv:1907.05686, 2019.
  34. Lut-nn: Empower efficient neural network inference with centroid learning and table lookup. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking, pp.  1–15, 2023.
  35. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  37. Towards conversational diagnostic ai. arXiv preprint arXiv:2401.05654, 2024.
  38. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  39. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  4820–4828, 2016.
  40. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  41. Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence, 38(10):1943–1955, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Mart van Baalen (18 papers)
  2. Andrey Kuzmin (8 papers)
  3. Markus Nagel (33 papers)
  4. Peter Couperus (1 paper)
  5. Cedric Bastoul (7 papers)
  6. Eric Mahurin (2 papers)
  7. Tijmen Blankevoort (37 papers)
  8. Paul Whatmough (22 papers)
Citations (13)

Summary

  • The paper introduces GPTVQ, a novel method that leverages multi-dimensional vector quantization to achieve superior LLM compression with minimal accuracy loss.
  • It employs Hessian-guided optimization and an efficient EM algorithm for precise codebook initialization and rapid post-training compression.
  • Empirical results on models like Llama-v2-70B show nearly 2 perplexity point improvements, underscoring its effectiveness for resource-constrained deployments.

Vector Quantization Unlocks Improved Efficiency in LLMs

Overview

The continuous evolution of LLMs presents a paradox of progress; while achieving remarkable linguistic feats, their growing parameter sizes hinder widespread deployment, particularly on resource-constrained platforms. Tackling this challenge, our exploration introduces GPTVQ, a method that leverages vector quantization (VQ) to achieve state-of-the-art compression without significant loss in accuracy. Our technique demonstrates that increasing quantization dimensionality can significantly improve the size-accuracy trade-off, establishing new benchmarks across a spectrum of LLMs.

The Core Technique

The GPTVQ approach extends existing methodologies by incorporating vector quantization in one or more dimensions, presenting a novel algorithm for post-training compression that is both fast and accurate. At its heart, GPTVQ intertwines quantization of one or multiple columns with updates to remaining unquantized weights, utilizing Hessian information from the layer output reconstruction Mean Squared Error (MSE). This process is enhanced by employing an efficient, data-aware version of the EM algorithm for codebook initialization, subsequently refined through integer quantization and SVD-based compression. Notably, GPTVQ not only offers superior size vs. accuracy optimizations but also demonstrates improved efficiency, reducing inference latency on mobile CPUS when compared to traditional integer formats.

Empirical Validation

Our empirical results underscore GPTVQ's efficacy across various LLMs, including Llama-v2 and Mistral, with dramatic improvements noted in model size vs. accuracy metrics. For instance, in Llamav2-70B, GPTVQ achieves nearly 2 perplexity points improvement over conventional uniform quantization methods, highlighting its potential to redefine compression standards in LLMs. Additionally, our method showcases practical viability with its ability to process significantly sized models within an acceptable timeframe on standard hardware.

Future Directions and Impact

GPTVQ's introduction marks a pivotal step towards making LLMs more accessible and deployable across varying computational environments. Looking ahead, further refining the balance between compression, accuracy, and efficiency remains an ongoing pursuit. The scalability of vector quantized networks opens new avenues for research, especially in developing hardware and software ecosystems that can fully leverage these advancements. Moreover, as we continue to push the boundaries of network efficiency, understanding the broader implications, including potential biases introduced through quantization, will be critical in ensuring that these technological advancements contribute positively to the democratization of AI.

Conclusion

In summary, GPTVQ elevates the discourse on LLM efficiency by presenting a method that not only considerably reduces model sizes without significant performance degradation but also suggests pathways for enhanced operational efficiency. As the AI community ventures further into an era dominated by colossal neural networks, methodologies like GPTVQ offer a beacon of hope for harmonizing the scales of innovation with the realities of deployment constraints.