Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ultra-Low-Bit Post-Training Quantization

Updated 17 March 2026
  • Ultra-low-bit PTQ is a quantization technique that compresses neural network parameters to 2–4 bits, significantly reducing memory footprint without retraining.
  • It employs methods such as block-wise reconstruction and adaptive per-weight rounding to minimize information loss and maintain key performance metrics like perplexity and accuracy.
  • Frameworks like TesseraQ integrate dequantization-scale tuning and progressive rounding to achieve robust ultra-low-bit quantization, setting new performance benchmarks.

Ultra-low-bit Post-Training Quantization (PTQ) refers to the process of reducing the numerical precision of parameters and activations in pretrained deep neural networks, particularly LLMs, to ultra-low bit-widths (as low as 2–4 bits), after training has completed. The goal is to decrease memory footprint and increase inference throughput while maintaining negligible degradation in key performance metrics such as perplexity and downstream accuracy. The TesseraQ framework exemplifies state-of-the-art methodology in this regime by combining block-wise reconstruction, adaptive per-weight rounding, and dequantization-scale optimization for robust ultra-low-bit quantization in LLMs (Li et al., 2024).

1. Fundamental Principles of Ultra-Low-Bit PTQ

Ultra-low-bit PTQ seeks maximal model compression without retraining. Uniform affine quantization is typically employed, converting a real-valued tensor WRnW\in\mathbb R^n to NN-bit representation via

Wq=clamp(W/s+z,0,2N1)W^q = \mathrm{clamp}\Bigl(\bigl\lfloor W/s \bigr\rceil + z,\,0,\,2^N-1\Bigr)

s=γmax(W)βmin(W)2N1,z=βmin(W)/ss = \frac{\gamma\,\max(W) - \beta\,\min(W)}{2^N-1},\qquad z = -\bigl\lfloor\beta\,\min(W)/s\bigr\rceil

W^=s(Wqz)\widehat W = s\bigl(W^q - z\bigr)

Crucial challenges in ultra-low-bit quantization derive from severe information loss, especially as bit-width falls below 4, which can cause substantial accuracy degradation in large transformer-based models. PTQ approaches such as AWQ and OmniQuant focus on optimizing per-layer or per-block scaling and clipping parameters to minimize quantization error, while GPTQ augments with Hessian-based individual weight corrections using layer-wise loss but is limited in granularity.

TesseraQ advances this landscape by (a) introducing a block-wise forward reconstruction objective, (b) parameterizing and optimizing per-weight “rounding” decisions, and (c) refining dequantization scaling on top of existing PTQ schemes (Li et al., 2024).

2. Block-Wise Reconstruction and Loss Formulation

TesseraQ partitions the model into transformer blocks. Quantization error is minimized not at the layer or parameter level alone, but by comparing the forward outputs from each block before and after quantization. The objective is

minϵblock(θ+ϵ,Xb)block(θ,Xb)F2\min_{\epsilon}\, \|\mathrm{block}(\theta + \epsilon,\,X_b) - \mathrm{block}(\theta,\,X_b)\|_F^2

b=1Bblock(θ^(b),Xb)block(θ(b),Xb)F2\approx \sum_{b=1}^B \|\mathrm{block}(\widehat\theta^{(b)},X_b) - \mathrm{block}(\theta^{(b)},X_b)\|_F^2

where the calibration set {Xb}\{X_b\} contains representative samples for each block.

This finer granularity accounts for intra-block dependencies such as residual and self-attention interactions, providing a tighter surrogate to end-to-end task loss and improving quantization fidelity compared to layer-wise approaches.

3. Adaptive Rounding and Progressive Hardening

Direct, discrete optimization of per-weight rounding (α{0,1}d\alpha\in\{0,1\}^d) is computationally intractable. TesseraQ relaxes this problem by introducing continuous variables νRd\nu\in\mathbb{R}^d for each weight and defining

αi={σ(νi)=11+eνiiSSoft 1νi>0iSHard\alpha_i = \begin{cases} \sigma(\nu_i) = \frac{1}{1+e^{-\nu_i}} & i \in \mathcal{S}_\mathrm{Soft} \ \mathbf{1}_{\nu_i>0} & i \in \mathcal{S}_\mathrm{Hard} \end{cases}

with SSoftSHard={1,,d}\mathcal{S}_\mathrm{Soft} \cup \mathcal{S}_\mathrm{Hard} = \{1,\dots,d\} and SSoftSHard=\mathcal{S}_\mathrm{Soft} \cap \mathcal{S}_\mathrm{Hard} = \emptyset.

Progressive Adaptive Rounding (PAR) proceeds in KK rounds. In each round, a “hardening score” HS(νi)=σ(νi)0.5HS(\nu_i) = |\sigma(\nu_i) - 0.5| identifies weights furthest from the decision boundary; the lowest Pk%P_k\% are selected for hardening. This progressive transition from soft (differentiable) to hard (discrete) rounding stabilizes optimization, avoids training instability, and ensures convergence. After KK rounds, all weights are deterministically rounded (αi{0,1}\alpha_i\in\{0,1\}).

Simultaneously, a dequantization-scale parameter δR\delta\in\mathbb R is introduced: θ^=2σ(δ)s(θqz),σ(δ)(0,1)\widehat\theta = 2\,\sigma(\delta)\, s\, (\theta^q - z),\quad \sigma(\delta)\in(0,1) δ\delta is updated by gradient descent on the same block-reconstruction loss, with small weight decay for stability.

4. Calibration Algorithm and Integration with Pre-Existing PTQ Methods

The calibration process operates per block and accepts as inputs the pretrained FP16 model, a small calibration set, and PTQ scales/zero-points (e.g., from AWQ or OmniQuant). Initialization sets ν\nu to reproduce FP16 precision, then iteratively alternates Adam-driven minimization (with respect to ν\nu and δ\delta) and hardening of rounding variables at each PAR round.

Key hyperparameters are:

  • PAR rounds K=20K=20, steps per round T=250T=250
  • Learning rate 1×1031 \times 10^{-3}, decay 10410^{-4} on δ\delta
  • Calibration batch size: 4, total samples: 512
  • Group size: 64 or 128 for per-group quantization
  • Weight-bitwidth: 2–4 bits; Activation-bitwidth: 3–16 bits

TesseraQ is explicitly architected for plug-in compatibility. For AWQ, apply TesseraQ’s block-wise rounding and dequant tuning after AWQ’s scale/clip optimization. For OmniQuant, freeze the block-wise scale/clip results and apply TesseraQ. No modifications to the underlying AWQ or OmniQuant implementations are required beyond exporting ss, zz and activations (Li et al., 2024).

5. Empirical Results in Ultra-Low-Bit Quantization

Empirical evaluations demonstrate the superiority of TesseraQ across multiple schemes and benchmarks. Representative results for LLaMA-2-7B and LLaMA-3.1-8B are summarized below.

Table A. WikiText2 Perplexity (LLaMA-2-7B, weight-only quant., group size 128)

Bitwidth Method PPL
W2A16g128 AWQ 14.65
OmniQuant 11.06
TesseraQ 6.82
W3A16g128 AWQ 6.19
OmniQuant 6.03
TesseraQ 5.71
W4A16g128 AWQ 5.82
OmniQuant 5.74
TesseraQ 5.56

Table B. Average Zero-Shot Accuracy (%) on 5 Tasks (LLaMA-2-7B)

Bitwidth Method Avg. Acc.
W2A16g128 AWQ 50.52
OmniQuant 47.59
SignRound 55.92
TesseraQ 59.27
W3A16g128 AWQ 62.87
OmniQuant 62.66
SignRound 63.72
TesseraQ 63.59
W4A16g128 AWQ 63.65
TesseraQ 64.19

Table C. Weight-Activation Quantization W4A4 (LLaMA-2-7B)

Method WT2 PPL ↓ C4 PPL ↓ Avg Acc ↑
AWQ 24.30 30.39 53.65
TesseraQ 10.45 12.77 60.48

Additionally, combining TesseraQ with rotation-based schemes such as QuaRot on LLaMA-3.1-8B (W3A3 quantization) yields:

  • QuaRot+GPTQ: Avg accuracy = 62.87
  • QuaRot+TesseraQ: Avg accuracy = 65.12

The consistent improvements suggest TesseraQ’s methods substantially reduce quantization-induced degradation even at 2–3 bits (Li et al., 2024).

6. Practical Considerations, Limitations, and Prospective Extensions

Strengths: TesseraQ’s per-weight adaptive rounding increases expressivity, allowing finer control over quantization error compared to layer-wide methods. Block-reconstruction captures cross-layer residuals and attention effects. The progressive hardening schedule improves optimization stability and convergence. Dequantization-scale tuning offers an additional compensatory mechanism against systematic rounding biases.

Limitations: TesseraQ calibration requires 3–6 hours (on a single A100 GPU, 65 GB) per model; reducing the number of PAR rounds or calibration set size expedites calibration but trades off accuracy. Hyperparameters such as PAR schedule and learning rate influence results but exhibit moderate sensitivity.

Future Directions: Integration with non-uniform (rotation-based, codebook-based) quantizers may further improve outlier robustness at ultra-low bit-widths. Reducing calibration data requirements via distillation or data-free approaches is an open avenue. Automated mixed-precision selection and hardware-aware INT2 inference kernel implementations remain as prospective research directions (Li et al., 2024).

7. Summary and Impact

Ultra-low-bit PTQ, as exemplified by TesseraQ, enables deployment of LLMs on highly resource-constrained platforms with minimal performance degradation. The synergistic application of block-wise loss, adaptive per-weight rounding, and dequant-scale tuning advances the state of the art in ultra-low-bit quantization, establishing new performance benchmarks on standard tasks and models without modifying or retraining the original model weights. TesseraQ’s design is both modular and compatible with a wide range of existing quantization pipelines, broadening its applicability in practical and research settings (Li et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ultra-Low-Bit Post-Training Quantization (PTQ).