REQuant: CNN Post-Training Quantization
- The paper introduces REQuant, a unified and mathematically grounded framework for CNN post-training quantization that minimizes MSE and preserves accuracy through layer-wise optimization.
- REQuant is a collection of algorithmic techniques including golden-section search, non-uniform codebook construction, and weight-space transformations to enhance quantization precision.
- The method achieves near-zero accuracy loss for deep CNNs on ultra-low bit-width deployments without retraining, outperforming conventional uniform quantization approaches.
REQuant for CNN Post-Training Quantization is a collection of algorithmic frameworks and analytical routines for optimally quantizing convolutional neural networks after training, with a focus on minimizing information loss and preserving task accuracy at ultra-low bit-widths. The umbrella term "REQuant" (Editor’s term) here covers several distinct but related methodologies introduced in the literature from 2018 to 2025, encompassing range estimation via MSE minimization, non-uniform codebook construction, post-quantization sensitivity correction, and re-quantization of pre-quantized models. These methods systematically address the quantization process for weights and activations, integrating mathematical guarantees, closed-form update schemes, convex optimization, and empirical calibration to achieve high efficiency and accuracy—especially in the absence of retraining or extensive labeled examples.
1. Mathematical Formulation and Quantization Models
The core principle underlying REQuant methods is the modeling of post-training quantization as a layer-wise optimization problem. Given a full-precision weight tensor for a particular layer and a target bit-width , symmetric uniform quantization is defined over a clipping interval with . A range-scaling parameter is introduced, yielding a scale
with the quantized and reconstructed weights
The mean squared error (MSE) between and its quantized version per layer is
and the overall network error splits additively across layers. Optimization then proceeds via one-dimensional search for the global minimizer for each layer independently, leveraging the proven local (and thus global over ) convexity of per region of fixed integer assignments (Yang et al., 5 Oct 2025). This convex structure is fundamental: it guarantees the existence of a unique minimum and enables efficient global search methods.
2. Golden-Section Range Search and Weight-Space Transformations
To identify the optimal range parameter minimizing per-layer quantization error, REQuant applies golden-section search over to a precision (e.g., ). The computational complexity is per layer, with full independence across layers. This guarantees rapid convergence in practice (typically 30–50 iterations per layer).
A further innovation is the application of weight-space transformations prior to quantization. Using the transformation , weights are redistributed such that the resulting quantization allocates finer resolution near zero, corresponding to the empirical distribution of weights in modern CNNs. After quantizing and then inverting this transformation, the MSE in the original weight space maintains the same convexity in , and the search procedure remains unchanged. Empirically, this combined "clip+reshape" strategy is critical: ablations on ResNet-18 (CIFAR-10, 4-bit) demonstrate an improvement from 89.1% to 94.7% top-1 accuracy (Yang et al., 5 Oct 2025).
3. Non-Uniform Quantization via Codebook Optimization
Alternative REQuant formulations employ non-uniform quantizers by constructing layer-specific codebooks to minimize distortion relative to the true underlying distribution of weights. These codebooks are derived by fitting a bell-shaped density (Gaussian or Laplace, verified via Kolmogorov–Smirnov test) and then deterministically partitioning the optimally clipped weight interval into non-uniform bins , each associated with a reconstruction level . The iterative minimization alternates closed-form conditional mean updates for (given ) and midpoint updates for (given ):
- Convergence is rapid (5–10 iterations per layer). This direct minimization of true MSE contrasts with uniform quantization, which allocates bins obliviously with respect to weight density. Empirical studies report consistently lower MSE on ResNet-50 and higher top-1 accuracy for REQuant versus both uniform and other non-uniform approaches (Luqman et al., 2024).
4. Correction, Bit Allocation, and Hybrid Schemes
REQuant pipelines introduced closed-form, analytically-driven threshold and bit allocation schemes, such as Analytical Clipping for Integer Quantization (ACIQ) and per-channel bit allocations
subject to a total codebook constraint. Bias-correction schemes further ensure the mean and -norm between quantized and original weights are preserved, via simple recalibrations of the form
Final weights are re-centered and scaled before deployment. These approaches allow hybrid bit-width assignments across channels and enforce quantization precision precisely where most impactful, with no retraining or full-data access (Banner et al., 2018).
5. Re-Quantization and Hardware Adaptivity
Later extensions of REQuant address the deployment of CNNs on diverse fixed-point hardware by enabling rapid re-quantization of already quantized models. This multi-step post-training re-quantization procedure involves: (i) bias correction (re-centering layer outputs based on calibration set mean differences), (ii) weight clipping (eliminating outliers pre-quantization for optimal scale allocation), (iii) weight correction (projecting float weights to new quantizer lattices), and (iv) round-error folding (modifying integer weights/scales to exactly satisfy power-of-2 multiplier constraints). Empirical application to MobileNetV2 demonstrates accuracy retention within 0.4% for symmetric re-quantization and 0.64% for symmetric + power-of-2 scaling (Manohara et al., 2023). This flexible post-training adaptation enables deployment on various accelerators (NPUs, TPUs, DPUs) without fine-tuning or error-prone scale conversions.
6. Post-Quantization Sensitivity and Loss-Preserving Sparse Correction
Recent work introduces the Post-quantization Integral (PQI) as a rigorous sensitivity metric. PQI computes the integrated gradient of network loss along the path from original to quantized weights,
which upper-bounds the loss impact of each quantization operation. This sensitivity scoring enables self-adaptive outlier selection (higher PQI entries kept in higher precision) and step-wise significant-weights detach (greedily rescuing weights with maximal predicted error impact). In CNNs such as ResNet-50 on ImageNet, this yields quantized models whose top-1 accuracy at 4 bits improves from 75.1% to 75.8%, with marginal memory and runtime overhead () (Hu et al., 28 Feb 2025).
7. Experimental Evaluation, Limitations, and Best Practices
Across the REQuant family, state-of-the-art post-training quantization performance is reported:
- ResNet-18 (CIFAR-10): 4/4 bit top-1 accuracy: histogram scaling 90.02%, AdaRound 85.01%, OMSE 92.62%, SQuant 93.70%, REQuant 94.67% (Yang et al., 5 Oct 2025)
- Inception-v3 (CIFAR-10): 4/4 bit: histogram 82.84%, AdaRound 68.37%, OMSE 92.21%, SQuant 93.29%, REQuant 89.40% (Yang et al., 5 Oct 2025)
- ResNet-50 (ImageNet): 8 bit: REQuant 72.18% vs. APoT 69.31% vs. ACIQ 67.62% (Luqman et al., 2024); 4 bit: full REQuant pipeline 71.8%, baseline 64.5% (Banner et al., 2018)
REQuant’s key characteristics:
- Requires no fine-tuning or full dataset; calibration is restricted to summary statistics or small unlabeled subsets.
- Achieves near-zero (≤1–2%) accuracy loss down to 4 bits for large vision models; output memory and bandwidth reductions are multiplicative in bit-width, and deployment speed-ups can exceed on integer-oriented accelerators.
- Most implementations recommend per-layer or per-channel quantization, regular PQI recomputation during iterative sparse correction, and allocation of outlier/sensitive weights kept at higher precision under a strict memory budget ( of total parameters).
- Limitation: extremely skewed distributions, small CNNs, or 2-bit quantization may require increased granularity in clipping or sensitivity estimation; first/last layers often remain at higher bit-widths in practical deployments.
8. Conclusion
REQuant methods unify advances in range estimation, non-uniform quantizer construction, closed-form calibration, and post-quantization error correction into a comprehensive toolbox for CNN post-training quantization. They stand out for their mathematically grounded guarantees (convexity of MSE, optimality of codebook updates), empirical effectiveness across modern CNNs, and versatility for both weight and activation quantization. By integrating per-layer optimization, calibration-efficient pipelines, and fine-grained sensitivity corrections, REQuant enables highly efficient compression and hardware portability for deep vision models with minimal loss of predictive performance (Yang et al., 5 Oct 2025, Luqman et al., 2024, Banner et al., 2018, Manohara et al., 2023, Hu et al., 28 Feb 2025).