GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2210.17323v2)

Published 31 Oct 2022 in cs.LG

Abstract: Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex LLMling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

PDF Abstract

GPTQ: Efficient Post-Training Quantization of Large-Scale Generative Pre-trained Transformers

Introduction

The proliferation of Generative Pre-trained Transformer (GPT) models has significantly advanced the state of natural language processing, enabling unprecedented performance across a wide range of tasks. However, the deployment of these models, particularly the larger variants such as GPT3-175B, is severely hampered by their high computational and memory requirements. Addressing this challenge, this paper introduces GPTQ, a novel post-training quantization technique tailored for the efficient compression of large-scale GPT models. GPTQ demonstrates remarkable efficiency, enabling the compression of models with hundreds of billions of parameters to 3-4 bits per weight without substantial loss in accuracy.

Methodology

GPTQ relies on an innovative one-shot weight quantization methodology that utilizes approximate second-order information. This approach enables the substantial reduction of model size and computational requirements while preserving the model's performance. The technique compresses models to 3 or 4 bits per weight, significantly lowering the bitwidth required to represent each weight. Remarkably, GPTQ can process models as large as 175 billion parameters within approximately four GPU hours, a feat that markedly outpaces existing methods for large-scale models.

Key to GPTQ's success is its ability to maintain accuracy even at these high compression rates. The method employs a detailed calibration stage that leverages a minimal dataset, ensuring that the quantized models can still deliver high-quality generative capabilities. Furthermore, the quantization process introduced by GPTQ does not necessitate extensive retraining or fine-tuning, making it a practical solution for efficiently deploying large models.

Experimental Results

GPTQ's efficacy is demonstrated through extensive experiments across a spectrum of generative tasks. The results show that GPTQ can quantize models to 3-4 bits per weight with only minimal impacts on model perplexity and fidelity. For instance, the quantized version of the OPT-175B model exhibits perplexity scores that closely match those of the uncompressed baseline, even under extreme compression to 3 bits per weight.

Moreover, GPTQ not only preserves the quality of generated text but also significantly enhances the efficiency of model deployment. The paper highlights the practical implications of this compression, notably the ability to run a 175 billion-parameter model on a single GPU—a capability that dramatically lowers the barrier to leveraging advanced generative models for a wider range of applications and users. Additionally, the method yields a substantial speedup in inference times, achieving up to 4.5x faster execution on cost-effective hardware.

Discussion and Implications

The introduction of GPTQ represents a significant step forward in the post-training quantization of large-scale transformer models. By enabling efficient compression without notable loss in accuracy, GPTQ facilitates broader access to state-of-the-art generative models, paving the way for their deployment in diverse real-world scenarios where computational resources are limited.

Looking forward, GPTQ opens avenues for future research in the compression of generative models, including potential explorations into activation quantization and hardware-accelerated execution of quantized models. Moreover, the successful application of GPTQ to the largest GPT models suggests that similar methodologies could be extended to other transformer architectures, potentially unlocking new efficiencies across a broad array of AI domains.

In summary, GPTQ presents a compelling solution to the challenges of deploying large-scale generative models, offering a balanced trade-off between model size, performance, and computational efficiency. Its capability to execute highly compressed models on constrained hardware without significant performance degradation holds promise for democratizing access to advanced AI technologies.