Papers
Topics
Authors
Recent
Search
2000 character limit reached

MatGPTQ: Post-Training Multi-Precision Quantization

Updated 10 February 2026
  • The paper introduces a one-shot PTQ algorithm that jointly optimizes nested bit-widths with cross-bit error compensation to preserve accuracy across varied precisions.
  • MatGPTQ employs MSB slicing to extract lower-precision models at runtime, enabling flexible deployment on devices with different memory and latency constraints.
  • The approach integrates budget-aware per-layer bit-width assignment and efficient CUDA kernels, resulting in significant throughput and latency improvements during inference.

Post-Training Matryoshka Quantization (MatGPTQ) is a post-training quantization (PTQ) framework designed to enable accurate and efficient multi-precision deployment of LLMs from a single quantized checkpoint. Leveraging the Matryoshka Quantization (MatQuant) principle, MatGPTQ permits runtime slicing of most-significant bits (MSBs) to extract models at arbitrary bit-widths (e.g., int8, int4, int3, etc.), optimizing for various device memory or latency constraints without retraining or storing separate models. Unlike previous Matryoshka quantization schemes that require quantization-aware training (QAT), MatGPTQ uses a one-shot PTQ pipeline to jointly optimize a parent model for several target precisions, integrates a budget-aware per-layer bit-width search, and introduces highly efficient device-level kernels for inference (Kleinegger et al., 3 Feb 2026).

1. Matryoshka Quantization Framework

Traditional PTQ and QAT approaches generate a single quantized model at one fixed bit-width, e.g., int8 or int4. Matryoshka Quantization (MatQuant) introduces the concept of nesting quantization levels within a single parent checkpoint: a c-bit quantized model encodes r-bit sub-models in its top r MSBs. At inference, the S(qc, r) slicing operator extracts a lower-precision r-bit model from the parent c-bit code. This nesting allows for bit-width interpolation and runtime selection of the quantization level, supporting diverse device and workload constraints from a single binary (Nair et al., 10 Feb 2025, Kleinegger et al., 3 Feb 2026).

However, optimizing model weights so that all nested bit-widths retain acceptable accuracy is substantially more complex. Each low-precision branch's quantization error must be compensated in the higher-precision branches during optimization, rendering standard PTQ ineffective for Matryoshka quantization.

2. Multi-Precision Quantization Objective

MatGPTQ formalizes the multi-precision quantization as a constrained joint minimization problem. For model layer ℓ, let WRdrow×dcolW_\ell \in \mathbb{R}^{d_{row} \times d_{col}} be the FP32 weights, QcQ^c_\ell the c-bit quantization, and R={r1,...,rK}{2,...,c}R = \{r_1, ..., r_K\} \subset \{2, ..., c\} the set of target nested precisions.

Bit-Slicing Operator

Given c-bit code qcq^c, slicing to r bits is defined as:

S(qc,r)=clamp(qc/2cr,0,2r1)2crS(q^c, r) = \mathrm{clamp}(\lfloor q^c / 2^{c - r} \rceil, 0, 2^r - 1) \cdot 2^{c - r}

This preserves the top r MSBs, yielding a valid r-bit representation within the c-bit code (Kleinegger et al., 3 Feb 2026).

Joint Reconstruction Loss

MatGPTQ's layerwise quantization minimizes the weighted average of per-precision dequantization errors, using calibration activations XRdcol×NX_\ell \in \mathbb{R}^{d_{col} \times N}:

Qc=argminQc{0,...,2c1}drow×dcolrRλrS(Qc,r)XWXF2Q_\ell^{c*} = \underset{Q_\ell^c \in \{0, ..., 2^c - 1\}^{d_{row} \times d_{col}}}{\arg\min} \sum_{r \in R} \lambda_r \| S(Q_\ell^c, r) X_\ell - W_\ell X_\ell \|_F^2

where λr\lambda_r are user-controlled precision importance weights.

Cross-Bit Error Compensation

To propagate residual quantization errors during one-shot PTQ, MatGPTQ averages the errors across all target precisions:

Ej=1RrR[W:,jdequant(S(Q:,jc,r))]E_j = \frac{1}{|R|} \sum_{r \in R} [ W_{:,j} - \mathrm{dequant}(S(Q^c_{:,j}, r)) ]

and applies a GPTQ-style update based on the blockwise Hessian inverse (Kleinegger et al., 3 Feb 2026).

3. One-Shot MatGPTQ PTQ Algorithm

MatGPTQ follows the GPTQ PTQ workflow but replaces the scalar rounding in GPTQ with joint multi-precision quantization and cross-bit error compensation.

Key Steps

  • Partition the layer weight matrix into blocks (typically 128–256 columns).
  • For each column in a block, solve for the c-bit integer code that minimizes the multi-precision objective via brute-force (GPU vectorized) search, as shown in Algorithm 2 of (Kleinegger et al., 3 Feb 2026).
  • After quantizing each weight column, compute the mean error over all rRr \in R and propagate it via the (precomputed) Hessian inverse as in standard GPTQ.
  • Use a small calibration set (e.g., N=1024N=1024 sequences of length 2048 from Fineweb-Edu, totaling ≈2M tokens) for all activations.

This one-shot process produces a single sliceable checkpoint that encodes multiple precisions at once. The computational cost is only ≈5× slower per layer than single-precision GPTQ for R={3,4,8}R = \{3,4,8\}, but obviates the need to repeatedly quantize for each target precision.

4. Budget-Aware Per-Layer Bit-Width Assignment

To optimize memory and latency under a global bit budget, MatGPTQ integrates an EvoPress-based [Sieberling et al. ’25] bit allocation scheme. This evolutionary search determines heterogeneous per-layer bit-widths b1,...,bLb_1, ..., b_L under a global budget constraint bBtot\sum_\ell b_\ell \leq B_{tot}:

  • Begin with uniform bit-widths (all b=cb_\ell = c).
  • For multiple generations, randomly mutate by reducing one layer’s bit-width and increasing another's (level-switch), keeping the sum constant.
  • Multi-stage evaluation is performed: rough estimate (small calibration subset), select top configurations, then evaluate on the full calibration set.
  • Select the configuration minimizing perplexity (PPL) or mean squared error (MSE) to update the parent solution.

This allows fine-grained control of average quantization bits per parameter, and empirical results show such non-uniform assignments are Pareto-superior to any uniform assignment at the same effective bit-width (Kleinegger et al., 3 Feb 2026).

5. Device-Level Implementation and Inference Kernels

MatGPTQ delivers production-grade CUDA kernels enabling on-the-fly slicing and mixed-precision inference:

  • The storage layout packs c-bit codes into three buffers (64-bit for two base bits and two 32-bit masks for extra bits), supporting 2–4 bits/weight without additional memory.
  • At runtime, weights are unpacked and dequantized to FP16 in registers using specialized instructions.
  • For batch size ≥8, a TensorCore path reshapes weights (offline) into mma.m16n8k16 format to maximize throughput, loading tiles for accumulation in FP32.
  • For smaller batches, a SIMT (single-instruction, multiple-thread) fallback executes FP16×FP16×FP32 GEMM with identical tile sizes.
  • Changing the bit-width r at runtime only requires adjusting the slicing index in S(q,r)S(q,r) and the dequantization scales, ensuring zero checkpoint reformats or further quantization.

Pseudocode for both the quantization and inference kernels is provided in the reference for complete reproducibility.

6. Experimental Evaluation and Impact

Benchmarking on LLMs such as LLaMA 3.1 8B (Base/Instr), Qwen3 8B/14B, and Phi-3-Medium across language modeling (Wikitext2 PPL) and zero-shot reasoning tasks (average over ARC-c/e, HellaSwag, PIQA, Winogrande) demonstrates:

  • Uniform Multi-Precision (Average of 6 models):
    • 8-bit: MatGPTQ within 0.65% of GPTQ.
    • 4-bit: within 0.33%.
    • 3-bit: MatGPTQ outperforms GPTQ by +1.34% (due to multi-precision regularization).
  • Interpolation (e.g., slicing c=8→r=6): within 0.7% of native GPTQ-6 bit accuracy on average.
  • Per-Layer Mix-and-Match (Budget: 3 bits avg):
    • LLaMA 8B-Instr at 2.5 bits avg: MatGPTQ-EP achieves Task Avg 59.30 versus GPTQ-2bit 35.39.
    • At avg 3 bits: MatGPTQ-EP 68.58 vs GPTQ 64.57.
  • Comparison to QAT Matryoshka/OmniQuant: MatGPTQ-PTQ surpasses OmniQuant+MatQuant at {3,4,8} bits by 0.2–0.7 points Task Avg on Gemma2 9B and Mistral 7B in FFN-only quantization.
  • Latency and Throughput: GEMM speedups of 3×–5.6× over torch.matmul at 2–4 bits. For LLaMA 8B-Instr (RTX A6000):
    • FP16: 23.53 ms (42.5 tok/s)
    • 4-bit: 9.13 ms (109 tok/s), speedup 2.58×
    • 3-bit: 8.03 ms (124 tok/s), speedup 2.93×
    • 2-bit: 7.24 ms (138 tok/s), speedup 3.25×
    • vLLM integration provides 1.5–3.5× gains for realistic decode workloads (Kleinegger et al., 3 Feb 2026).

7. Relation to the Matryoshka Quantization Paradigm

The Matryoshka Quantization scheme (Nair et al., 10 Feb 2025) was originally introduced to facilitate storage and runtime flexibility by encoding multiple bit-widths in a single checkpoint via MSB slicing, using QAT (and optionally distillation) for consistency across branches. Whereas the original MatQuant required costly retraining and lacked device-level support, MatGPTQ extends this paradigm by providing an open, efficient, PTQ-based solution with multi-precision support, automated budget-aware optimization, and performant kernels for practical deployment.

In summary, MatGPTQ establishes a new standard for post-training Matryoshka-style quantization by delivering a one-shot, multi-precision PTQ approach that (i) retains or improves upon native-precision GPTQ accuracy, (ii) minimizes storage and deployment complexity, and (iii) enables practical, high-throughput, multi-precision inference from a single checkpoint (Kleinegger et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Post-Training Matryoshka Quantization (MatGPTQ).