Low-Bit Post-Training Quantization
- Low-Bit PTQ is a model compression technique that converts full-precision neural networks to low-bit representations using small calibration sets without further fine-tuning.
- It leverages optimization strategies such as coordinate descent, progressive reconstruction, and flatness regularization to minimize functional distortion and accuracy loss.
- The approach is applied to vision, speech, and language models, achieving near-lossless performance and enabling efficient deployment on resource-constrained hardware.
Low-Bit Post-Training Quantization (PTQ) is a model compression paradigm wherein a pretrained neural network, typically trained in full (e.g., float32) precision, is converted into a functionally equivalent model with low-precision (e.g., 2–4 bit) weights and/or activations using only a small calibration set, and without any further supervised fine-tuning. The goal is to obtain highly efficient inference with minimal accuracy loss, enabling deployment on memory‐ and compute‐constrained hardware. This article surveys foundational methodologies, algorithmic innovations, representational formulations, and empirical results that define the state of the art in low-bit PTQ across modern neural architectures, with a particular focus on approaches recently advanced for vision, speech, and LLMs.
1. Quantization Formulations and Core Algorithms
Classic PTQ strategies for low bit-width (≤4 bits) use a uniform quantizer parameterized by a per-tensor or per-channel scaling factor and zero-point, mapping a real-valued tensor to , where indexes codewords from a symmetric or asymmetric quantization grid. The major challenge is to select the quantizer parameters such that the functional distortion at the network output is minimized, particularly as quantization error is amplified at low bit-widths.
Coordinate Descent and Reconstruction-Based PTQ
COMQ exemplifies a coordinate-wise minimization approach, casting PTQ as minimizing the layer-wise output MSE between quantized and full-precision outputs under integer quantization. Each weight is decomposed as with constrained to the integer codebook, and the global error minimized alternately over the (shared or channel-wise) scale and bit-codes. The update is performed as a greedy coordinate descent, involving only dot-products and rounding, and achieves near-lossless accuracy (<1% Top-1 drop) in 4-bit ViTs and CNNs, outperforming AdaRound, BRECQ, and similar methods especially in ultra-low-bit regimes (Zhang et al., 2024). The method is hyperparameter-free and computationally efficient, requiring only 3-4 iterations per layer.
Progressive and Multi-Granularity Reconstruction
Progressive Fine-to-Coarse Reconstruction (PFCR) extends blockwise and layerwise PTQ to multi-level granularities, starting from module-wise units (e.g., ViT MHSA or MLP+shortcut), and iteratively forming coarser units via a hierarchical, fine-to-coarse reconstruction schedule. At each level , the quantization parameters are refined to minimize over all reconstruction units. Optimization proceeds progressively, with a Smooth → Rugged staged schedule to address instabilities of joint weight–activation quantization at low bits. PFCR achieves 75.61% Top-1 for 3-bit ViT-B on ImageNet, exceeding all prior PTQ baselines by large margins, and shows generalization to detection/segmentation backbones (Ding et al., 2024).
Flatness-Regularized and Noise Modeling Preconditioning
Quantization loss is known to scale with the curvature of the loss landscape at the pretrained solution. Methods such as DNQ model weight and activation quantization errors as independent Gaussian noise and inject these during fine-tuning to drive convergence toward flatter basins. Differential weight noise (for WQE) and stochastic dropout-based activation noise (for AQE) are ramped up, followed by SWA, so the model is robust to the anticipated PTQ errors. This approach reduces quantization-induced accuracy drops at 2–4 bits compared to all post-hoc-only methods, as validated in head-to-head ablations against AdaRound, BRECQ, and QDrop (Xia et al., 3 Nov 2025).
2. Representation Strategies for Ultra-Low Bitwidth
Mixed-Precision and Masking
PTQ1.61 demonstrates a structured mixed-precision assignment: a lightweight 1D binary mask (negligible overhead) selects a fixed fraction (e.g., top 20% by per-channel input activation norm) to be quantized with higher precision (4 bits), while the remaining channels are strictly binarized. Non-salient (binarized) channels incorporate learnable scaling vectors, optimized via a compound L2/angular blockwise loss, while salient 4-bit channels are quantized uniformly. This yields an effective rate of 1.61 bits/weight for LLMs. An additional LoRA-based preprocessing aligns weight saliency for more precise channel assignment. Across LLaMA and OPT variants, PTQ1.61 achieves a perplexity of 12.50 (LLaMA-7B, WikiText2, 1.61 bits/weight), outperforming PB-LLM and BiLLM by wide margins. An ablation confirms that each of: mask, scales, and LoRA-preprocessing is necessary for this regime (Zhao et al., 18 Feb 2025).
Graph-Based Mixed-Precision Assignment
MG-PTQ leverages a GNN to encode column/block dependencies (using Cholesky factors of the Hessian inverse as adjacency matrices) and outputs per-column bit-width assignments by running a Gumbel-Softmax estimator through a small classifier. This formulation integrates sensitivity and dependency structure for assigning bit allocations, resulting in significantly reduced perplexity at low average bit-width (e.g., 2.0) compared to GPTQ or AWQ (Liu et al., 30 Jan 2025).
Ternary Decomposition and Series Expansion
PTQTP decomposes each weight into two trit-planes (each ) and optimizes scalings via alternate ridge regression and discrete search, achieving 3.16 bits/weight but with purely addition/subtraction operations for inference—a hybrid of the expressiveness of ternary with the inference speed of binary quantization. Direct comparisons show, for example, 82.4% mathematical reasoning retention on Math-500, far exceeding all prior methods below 2 bits (Xiao et al., 21 Sep 2025).
FP = xINT recasts the entire model as a truncated series of integer-quantized “basis” networks: 0. This allows base models to be low-precision (2–4 bit), with summation handled via custom Abelian group operations (Editor’s term) to maintain exactness at inference. Empirically, only a small number of basis terms (T≈3–4) are needed for convergence to FP performance—4-bit quantized ResNet-50 can surpass the original accuracy (Zhang et al., 2024).
3. Robustness, Loss Landscape, and Optimization Techniques
Hessian-Aware Conditioning
HeRo-Q tackles the "low-error, high-loss" paradox in LLM PTQ by explicitly reducing the largest eigenvalue of the Hessian via a learnable rotation-compression transformation 1. This approach applies diagonal smoothing and orthogonal rotations prior to quantization, minimizing downstream sensitivity according to the spectral surrogate bound. This yields superior stability in extreme cases (e.g., W3A16) where prior methods (GPTQ, AWQ, SpinQuant) collapse, with empirical gains of 2–8 percentage points in zero-shot accuracy (Zhang et al., 29 Jan 2026).
DASH-Q, recognizing instability in cross-channel Hessian-based PTQ under limited calibration, discards all off-diagonal Hessian terms, retaining only the per-channel diagonal and solving N independent weighted least squares quantization subproblems. This leads to stable convergence and outperforms all recent baselines by 7.01 points on average at 2-bits in large LLMs, with high calibration robustness and up to 74× lower runtime than blockwise GPTQ (Kim et al., 15 Apr 2026).
MARR generalizes residual-reconstruction methods by introducing module-specific adaptive scaling coefficients (closed-loop PID control) for cross-layer residuals, mitigating the bias-variance trade-off created by the Hessian-approximation. This modular adaptation yields up to 20.2% performance improvement over prior residual methods in LLMs and up to 4.6% relative Top-1 gain in ViTs (Su et al., 18 May 2026).
Joint Weight/Activation Methods and Flatness via Activation-aware PTQ
QDrop addresses catastrophic collapse in low bitwidth PTQ (notably W2A2) by randomly dropping activation quantization during the weight-rounding phase, cultivating a flatter loss basin that generalizes under out-of-distribution quantization noise. Empirically, QDrop achieves up to 51.49% accuracy gain over prior methods in aggressive regimes and consistently reduces Hessian largest eigenvalue and trace, confirming enhanced flatness (Wei et al., 2022).
PD-Quant directly minimizes the end-to-end KL divergence of final network softmax predictions, aligning quantization objectives with inference-time loss. Together with Distribution Correction (DC) to match batch-norm statistics from the full distribution, these strategies allow 1–2 point improvements over the strongest random-drop PTQ at 2-bit settings (Liu et al., 2022).
4. Architectural and Task-Specific Advances
Transformers (Vision and LLMs)
Progressive Fine-to-Coarse Reconstruction (PFCR) and approaches such as COMQ (for ViTs) and TesseraQ (for LLMs) have tailored block reconstruction and adaptive rounding schemes, systematically addressing accumulated errors and rounding instabilities in multi-block architectures. TesseraQ, for instance, introduces Progressive Adaptive Rounding (PAR) and tunes per-block dequantization scale within a blockwise reconstruction loss, yielding major perplexity and accuracy improvements on LLaMA models at 2–3 bits (Ding et al., 2024, Zhang et al., 2024, Li et al., 2024).
SignRoundV2 introduces a scalable sensitivity metric (DeltaLoss), driven by first-order Taylor expansion and gradient-weighted quantization error, which informs mixed-precision allocation and scale pre-tuning in LLM PTQ; it achieves 1% variance from full-precision in production scenarios at 4–5 bits (Cheng et al., 4 Dec 2025).
QLLM targets activation outliers with an explicit channel disassembly/reassembly pipeline and then corrects residuals via lightweight low-rank tuning, drastically reducing quantization error in high-outlier LLMs and delivering state-of-the-art 4-bit quantization for LLaMA-2-70B in under 10 hours on standard hardware (Liu et al., 2023).
Speech, Super-Resolution, and TinyML
2DQuant exemplifies the dual-stage pipeline for SwinIR super-resolution models, combining distribution-optimized bound initialization (for mixed symmetric/asymmetric activations) with a distillation-based calibration for quantizer bounds. This yields up to 1.5 dB PSNR improvement at 2 bits relative to prior PTQ approaches and enables substantial compression and speedup (Liu et al., 2024).
Empirical studies in TinyML demonstrate strong accuracy, memory, and compute trade-offs using a modular PTQ pipeline with per-channel symmetric quantization for weights, asymmetric for activations, reconstruction loss minimization at the block level, and post-hoc bias tuning. At 4W/4A, models retain <5% accuracy loss, with more aggressive quantization requiring the aforementioned advanced methods for meaningful retention (Zhuo et al., 2022).
5. Implementation Guidance and Practical Considerations
Best practice across recent literature suggests the following for practitioners:
- Calibration dataset: Small calibration sets (hundreds to a few thousand unlabeled samples) generally suffice. For transformers and LLMs, calibration batches of 128×2048-token samples are standard.
- Update order: Greedy sorting by activation-weight product magnitude when updating bit codes/channel scale (e.g., in COMQ) reduces error.
- Iterations: For iterative methods, 3–4 passes typically suffice; extra passes bring diminishing marginal gains.
- Mixed-precision: When possible, leverage structured masking or graph-based assignment to allocate higher precision to critical channels/columns within a defined bit-width budget.
- Activation quantization: For end-to-end int inference, combine weight-oriented PTQ with established activation PTQ schemes (e.g., RepQ-ViT or RTN for LLMs).
- Resilience to outliers: Apply pre-processing procedures (Coarse-to-Fine, channel disassembly, DC) to handle distributed outliers robustly, especially in transformer architectures (Ding et al., 2023, Liu et al., 2023).
- Resource: Many state-of-the-art methods execute full quantization on models up to 70B parameters in a single-digit number of GPU-hours, and advanced schemes (e.g., DASH-Q) can reduce runtime further by orders of magnitude (Kim et al., 15 Apr 2026).
6. Limitations, Open Problems, and Future Directions
Despite substantial progress, several open challenges remain:
- Calibration dependency: Most methods require at least a minimal calibration set; data-free or zero-shot PTQ without performance collapse at ≤2 bits remains elusive (Kim et al., 15 Apr 2026).
- Activation quantization: Robust 2–3 bit joint weight–activation PTQ for transformers is not universally solved; advanced methods like QDrop and PFCR yield improvement, but stability relies on carefully tuned optimization schedules and flatness regularizers (Wei et al., 2022, Ding et al., 2024).
- Hessian estimation: Sensitivity and stability are impaired by noisy Hessian estimation with limited data. Diagonal-only approaches (DASH-Q) outperform those relying on off-diagonal compensation in such regimes (Kim et al., 15 Apr 2026).
- Bitwidth flexibility: Dynamic, per-token or hardware-adaptive bitwidth assignment with negligible accuracy degradation is an active topic, especially for layer/hardware code-design (Cheng et al., 4 Dec 2025, Liu et al., 30 Jan 2025).
- Efficient kernel support: Inference speed gains rely on hardware/ASIC support for non-standard formats (e.g., ternary planes); emulation via higher-precision may hinder realized efficiency, underscoring a need for co-designed hardware (Xiao et al., 21 Sep 2025).
- Quantization of instruction-tuned models: UPQ demonstrates INT2 quantization of instruction-tuned LLMs via blockwise PTQ followed by distillation-based QAT, yielding >85% retention of original reasoning and instruction accuracy—a new benchmark for the field (Lee et al., 10 Jun 2025).
Continued advances in robust loss-aware quantization, quantization-friendly model preconditioning, and hardware-software synergy are anticipated to further expand the capabilities and reach of low-bit PTQ across architectural classes and deployment regimes.