Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPTQ Quantization for LLM Compression

Updated 13 May 2026
  • GPTQ Quantization is a method for compressing neural network weights by leveraging Hessian-aware, second-order updates to minimize calibration-set deviation.
  • The technique employs one-pass, column-wise quantization with block/group-wise scaling to retain high accuracy even at low bitwidths like INT4 or INT3.
  • Empirical results show that GPTQ achieves near-lossless model performance with significant speedups, enabling scalable deployment of multi-billion parameter LLMs.

Gradient-based Post-Training Quantization (GPTQ) is a class of highly efficient, second-order methods for compressing LLMs and other neural architectures to low-bitwidth formats without retraining. GPTQ provides near-lossless INT4/INT3 quantization and serves as the de facto “workhorse” for post-training quantization of transformer models at the multi-billion parameter scale. The framework leverages Hessian-aware compensation, greedy coordinate updates, and block/group-wise parameterization to achieve state-of-the-art trade-offs in model size, accuracy retention, and inference throughput.

1. Mathematical Formulation and Algorithmic Structure

GPTQ quantization targets each linear weight matrix WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}} independently. The key objective is to minimize the calibration-set response deviation: minW^WXW^XF2\min_{\widehat W} \| WX - \widehat W X \|_F^2 where XRdin×mX \in \mathbb{R}^{d_\text{in} \times m} is a small matrix of calibration activations. This can be re-expressed using the empirical second-moment (“Hessian”) matrix H=XXH = X X^\top as: WW^H2=Tr[(WW^)H(WW^)]\| W - \widehat W \|_H^2 = \mathrm{Tr}[(W - \widehat W) H (W - \widehat W)^\top] The quantized weights W^\widehat W are constrained to a low-bit grid (e.g., INT4 or FP4), with per-group or per-channel scaling. GPTQ performs a one-pass, column-wise quantization: after quantizing each column, it applies a second-order compensation update to the remaining unquantized columns, using the Hessian or its block-wise/diagonal approximation. This yields the closed-form error-minimizing update per coordinate or group, enabling numeric stability and fast GPU kernels (Frantar et al., 2022, Proskurina et al., 2024, Zhang et al., 6 Aug 2025).

2. Theoretical Foundation and Guarantees

GPTQ is theoretically equivalent, in its back-to-front implementation, to Babai's nearest-plane algorithm for the Closest Vector Problem (CVP) on the Hessian-induced lattice. This geometric perspective provides explicit error bounds: under the no-clipping assumption, the quantization error per layer is upper-bounded by a function of the Cholesky diagonals of the Hessian matrix, with tight results under mild empirical conditions (Chen et al., 24 Jul 2025). GPTQ's iterative, Hessian-informed redistribution of rounding residuals (Optimal Brain Surgery style) is proven to be an efficient O(din2)O(d_\text{in}^2) approximation to the NP-hard, global group-wise minimum.

For stochastic/convex variants, high-probability \ell_\infty error bounds scale as O(logN)\mathcal{O}(\sqrt{\log N}) in the width NN—providing precise guidance for alphabet size and downstream softmax stability (Zhang et al., 6 Aug 2025). Feature-column reordering by descending norm is mathematically justified to reduce error magnitude. Regularization (minW^WXW^XF2\min_{\widehat W} \| WX - \widehat W X \|_F^20) should be set as a low multiple (minW^WXW^XF2\min_{\widehat W} \| WX - \widehat W X \|_F^21) of the mean Hessian diagonal, balancing error amplification and stability.

3. Practical Implementation and Extensions

GPTQ is implemented as layerwise, block-wise, or group-wise quantization, usually with group sizes of 32–128 for LLMs (Proskurina et al., 2024, Sander et al., 14 Jan 2026). Bitwidths of 3–4 retain near-baseline perplexity and accuracy: for OPT-175B, 4-bit GPTQ results in PPL increases of <0.2 (FP16=8.34, GPTQ-4b=8.37), with end-to-end speedups of 3–4.5x on modern GPUs (Frantar et al., 2022).

Computation is dominated by calculation and inversion (or Cholesky factorization) of the Hessian per block. GPU kernels fuse quantization and dequantization, supporting on-the-fly matrix–vector products for inference. GPTQ is composable with pre-quantization transformations (outlier-smoothing via optimized channelwise scaling and/or orthogonal rotation (Liu et al., 23 Jul 2025)), and can be hybridized with low-rank compensation (Liu et al., 23 Jul 2025), discrepancy-minimizing rounding (DiscQuant (Chee et al., 11 Jan 2025)), or post-hoc channel-aware bitwidth allocation (mixed-precision GPTQ (Yvinec et al., 2023)).

Recent extensions of GPTQ include:

  • Group-wise scale optimization: A two-stage framework first calibrates per-group scales using input statistics, then post-GPTQ coordinate-descent refinement further reduces global layerwise loss (Kim et al., 2 Feb 2026).
  • Asymmetric calibration (GPTAQ): Ensures each quantized layer directly matches the full-precision output, limiting cumulative drift; provided in a closed-form extension of GPTQ (Li et al., 3 Apr 2025, Li et al., 9 Apr 2026).
  • Fairness constraints (Fair-GPTQ): Introduces group-bias regularization terms to reduce the amplification of social stereotypes during quantization, adding a first-order de-biasing step that is efficiently implemented by an extra Hessian–gradient update (Proskurina et al., 18 Sep 2025).

4. Applications, Hardware Impact, and Empirical Results

GPTQ is central to production LLM inference for memory-constrained and commodity hardware, as demonstrated in vLLM, HuggingFace Transformers, and various custom datacenter and edge pipelines (Frantar et al., 2022, Sander et al., 14 Jan 2026). Its memory and compute efficiency enable deployment of models up to 175B–405B parameters on single high-memory GPUs (e.g., 80GB A100/Blackwell B200) (Li et al., 3 Apr 2025).

Empirical results consistently demonstrate that 4b and 3b GPTQ quantization on the Llama, OPT, Qwen, and PaLM2 are within <1 perplexity or <1–2% accuracy of their FP16 baselines across a range of open benchmarks (WikiText-2, C4, MMLU, ARC, HellaSwag, GSM8k, GPT4Eval, etc.) (Frantar et al., 2022, Proskurina et al., 2024, Zhang et al., 6 Aug 2025, Chee et al., 11 Jan 2025). Under aggressive quantization (2b), loss can be mitigated using group-wise optimization or hybrid low-rank correction (Kim et al., 2 Feb 2026, Liu et al., 23 Jul 2025). Recent work demonstrates that integrating GPTQ with advanced outlier-robust activation pre-processing or fine-tuned model distillation pipelines further reduces end-task accuracy loss (Sander et al., 14 Jan 2026).

When adapted to new FP4 “microscaling” hardware formats (MXFP4, NVFP4), GPTQ provides a baseline for post-training quantization; format-specialized rotations and scale grid selection in MR-GPTQ match or exceed existing INT4 benchmarks, nearly saturating hardware throughput (Egiazarian et al., 27 Sep 2025).

5. Limitations and Ongoing Developments

While GPTQ’s Hessian-aware updates prevent catastrophic accuracy loss even at very low bitwidths, several limitations persist:

  • Greedy, layer-by-layer quantization can accumulate activation drift, requiring either global blockwise approaches (DiscQuant (Chee et al., 11 Jan 2025)) or layer output realignment (GPTAQ (Li et al., 3 Apr 2025, Li et al., 9 Apr 2026)).
  • Residual bias and societal stereotypes may be differentially amplified by quantization; explicit fairness regularizers (Fair-GPTQ (Proskurina et al., 18 Sep 2025)) or statistical calibration frameworks can partially address these issues, but may incur non-negligible accuracy–bias trade-offs.
  • Hardware-specific formats (FP4, groupwise INT4) require specialized optimization—general GPTQ pipelines are not always optimal for new number representation schemes (Egiazarian et al., 27 Sep 2025).
  • Overheads for very large models—particularly memory for Hessian storage and runtime for group/block quantization—require chunking, factorization, or hybridization with low-rank/coordinate descent methods (Nair et al., 2024).

GPTQ formalizes and generalizes Optimal Brain Surgeon/Optimal Brain Quantizer, and is closely related to blockwise LDLQ, AdaRound, and BRECQ (Liu et al., 23 Jul 2025, Frantar et al., 2022). The connection to Babai’s nearest-plane algorithm positions GPTQ within lattice-based approximation theory, granting it geometric interpretability and facilitating error analysis (Chen et al., 24 Jul 2025). Recent works provide non-asymptotic error guarantees for both deterministic and stochastic variants, directly informing best practices for coordinate order, regularization, and alphabet size (Zhang et al., 6 Aug 2025).

DiscQuant represents a theoretical advancement, applying discrepancy minimization for globally optimal rounding across the network, often improving on GPTQ at the lowest bitwidths (Chee et al., 11 Jan 2025). Qronos extends GPTQ by jointly quantizing activations and weights with additional error-projection steps, empirically improving over pure GPTQ and providing even tighter theoretical error bounds (Zhang et al., 6 Aug 2025).

7. Prospects and Future Directions

The GPTQ framework continues to be actively extended along multiple axes:

  • Activation quantization: Integrating activation quantization with Hessian-aware scaling, and harmonizing with mixed-precision schemes.
  • Format-specialized kernels: Further exploiting hardware FP4/FP8/INT4 for ultra-fast inference, merging matrix rotations, scaling, and quantization directly into the compute kernel (Egiazarian et al., 27 Sep 2025).
  • Fairness–accuracy trade-offs: Developing fairness-bounded quantization variants, channel-level bias analysis, and integration with debiasing methods in the training pipeline (Proskurina et al., 18 Sep 2025).
  • Data-efficient and robust calibration: Employing OOD, adversarial, and distilled calibration sets without loss of quantization efficacy (Yvinec et al., 2023, Sander et al., 14 Jan 2026).
  • Scalable, blockwise/global quantization: Moving beyond greedy single-layer updates to global, discrepancy-aware, or coordinate-descent–based rounding for minimal error accumulation (Chee et al., 11 Jan 2025, Nair et al., 2024, Kim et al., 2 Feb 2026).

In summary, GPTQ remains the most widely adopted and theoretically principled post-training quantization algorithm for LLMs, achieving a rare combination of scalability, efficiency, and precision at production scale (Frantar et al., 2022, Chen et al., 24 Jul 2025, Zhang et al., 6 Aug 2025, Nair et al., 2024, Li et al., 3 Apr 2025, Chee et al., 11 Jan 2025, Egiazarian et al., 27 Sep 2025, Proskurina et al., 18 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPTQ Quantization.