GPTQ: Efficient Post-Training Quantization of Large-Scale Generative Pre-trained Transformers
Introduction
The proliferation of Generative Pre-trained Transformer (GPT) models has significantly advanced the state of natural language processing, enabling unprecedented performance across a wide range of tasks. However, the deployment of these models, particularly the larger variants such as GPT3-175B, is severely hampered by their high computational and memory requirements. Addressing this challenge, this paper introduces GPTQ, a novel post-training quantization technique tailored for the efficient compression of large-scale GPT models. GPTQ demonstrates remarkable efficiency, enabling the compression of models with hundreds of billions of parameters to 3-4 bits per weight without substantial loss in accuracy.
Methodology
GPTQ relies on an innovative one-shot weight quantization methodology that utilizes approximate second-order information. This approach enables the substantial reduction of model size and computational requirements while preserving the model's performance. The technique compresses models to 3 or 4 bits per weight, significantly lowering the bitwidth required to represent each weight. Remarkably, GPTQ can process models as large as 175 billion parameters within approximately four GPU hours, a feat that markedly outpaces existing methods for large-scale models.
Key to GPTQ's success is its ability to maintain accuracy even at these high compression rates. The method employs a detailed calibration stage that leverages a minimal dataset, ensuring that the quantized models can still deliver high-quality generative capabilities. Furthermore, the quantization process introduced by GPTQ does not necessitate extensive retraining or fine-tuning, making it a practical solution for efficiently deploying large models.
Experimental Results
GPTQ's efficacy is demonstrated through extensive experiments across a spectrum of generative tasks. The results show that GPTQ can quantize models to 3-4 bits per weight with only minimal impacts on model perplexity and fidelity. For instance, the quantized version of the OPT-175B model exhibits perplexity scores that closely match those of the uncompressed baseline, even under extreme compression to 3 bits per weight.
Moreover, GPTQ not only preserves the quality of generated text but also significantly enhances the efficiency of model deployment. The paper highlights the practical implications of this compression, notably the ability to run a 175 billion-parameter model on a single GPU—a capability that dramatically lowers the barrier to leveraging advanced generative models for a wider range of applications and users. Additionally, the method yields a substantial speedup in inference times, achieving up to 4.5x faster execution on cost-effective hardware.
Discussion and Implications
The introduction of GPTQ represents a significant step forward in the post-training quantization of large-scale transformer models. By enabling efficient compression without notable loss in accuracy, GPTQ facilitates broader access to state-of-the-art generative models, paving the way for their deployment in diverse real-world scenarios where computational resources are limited.
Looking forward, GPTQ opens avenues for future research in the compression of generative models, including potential explorations into activation quantization and hardware-accelerated execution of quantized models. Moreover, the successful application of GPTQ to the largest GPT models suggests that similar methodologies could be extended to other transformer architectures, potentially unlocking new efficiencies across a broad array of AI domains.
In summary, GPTQ presents a compelling solution to the challenges of deploying large-scale generative models, offering a balanced trade-off between model size, performance, and computational efficiency. Its capability to execute highly compressed models on constrained hardware without significant performance degradation holds promise for democratizing access to advanced AI technologies.