GPTVQ: The Blessing of Dimensionality for LLM Quantization (2402.15319v1)

Published 23 Feb 2024 in cs.LG and cs.CL

Abstract: In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to LLMs. Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

References (41)

Authors (8)

Mart van Baalen (18 papers)
Andrey Kuzmin (8 papers)
Markus Nagel (33 papers)
Peter Couperus (1 paper)
Cedric Bastoul (7 papers)
Eric Mahurin (2 papers)
Tijmen Blankevoort (37 papers)
Paul Whatmough (22 papers)

Citations (13)

View on Semantic Scholar

Summary

The paper introduces GPTVQ, a novel method that leverages multi-dimensional vector quantization to achieve superior LLM compression with minimal accuracy loss.
It employs Hessian-guided optimization and an efficient EM algorithm for precise codebook initialization and rapid post-training compression.
Empirical results on models like Llama-v2-70B show nearly 2 perplexity point improvements, underscoring its effectiveness for resource-constrained deployments.

Vector Quantization Unlocks Improved Efficiency in LLMs

Overview

The continuous evolution of LLMs presents a paradox of progress; while achieving remarkable linguistic feats, their growing parameter sizes hinder widespread deployment, particularly on resource-constrained platforms. Tackling this challenge, our exploration introduces GPTVQ, a method that leverages vector quantization (VQ) to achieve state-of-the-art compression without significant loss in accuracy. Our technique demonstrates that increasing quantization dimensionality can significantly improve the size-accuracy trade-off, establishing new benchmarks across a spectrum of LLMs.

The Core Technique

The GPTVQ approach extends existing methodologies by incorporating vector quantization in one or more dimensions, presenting a novel algorithm for post-training compression that is both fast and accurate. At its heart, GPTVQ intertwines quantization of one or multiple columns with updates to remaining unquantized weights, utilizing Hessian information from the layer output reconstruction Mean Squared Error (MSE). This process is enhanced by employing an efficient, data-aware version of the EM algorithm for codebook initialization, subsequently refined through integer quantization and SVD-based compression. Notably, GPTVQ not only offers superior size vs. accuracy optimizations but also demonstrates improved efficiency, reducing inference latency on mobile CPUS when compared to traditional integer formats.

Empirical Validation

Our empirical results underscore GPTVQ's efficacy across various LLMs, including Llama-v2 and Mistral, with dramatic improvements noted in model size vs. accuracy metrics. For instance, in Llamav2-70B, GPTVQ achieves nearly 2 perplexity points improvement over conventional uniform quantization methods, highlighting its potential to redefine compression standards in LLMs. Additionally, our method showcases practical viability with its ability to process significantly sized models within an acceptable timeframe on standard hardware.

Future Directions and Impact

GPTVQ's introduction marks a pivotal step towards making LLMs more accessible and deployable across varying computational environments. Looking ahead, further refining the balance between compression, accuracy, and efficiency remains an ongoing pursuit. The scalability of vector quantized networks opens new avenues for research, especially in developing hardware and software ecosystems that can fully leverage these advancements. Moreover, as we continue to push the boundaries of network efficiency, understanding the broader implications, including potential biases introduced through quantization, will be critical in ensuring that these technological advancements contribute positively to the democratization of AI.

Conclusion

In summary, GPTVQ elevates the discourse on LLM efficiency by presenting a method that not only considerably reduces model sizes without significant performance degradation but also suggests pathways for enhanced operational efficiency. As the AI community ventures further into an era dominated by colossal neural networks, methodologies like GPTVQ offer a beacon of hope for harmonizing the scales of innovation with the realities of deployment constraints.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1761957702828228997

https://twitter.com/MuzafferKal_/status/1762004707999150304

https://twitter.com/javaeeeee1/status/1763171248513679503