Quantization-Aware Approximations

Updated 9 December 2025

Quantization-aware approximations are methods that incorporate the quantization operator during training to ensure robust neural network performance even in low-precision regimes.
They employ surrogate gradients and constrained optimization techniques to overcome the non-differentiable nature of quantization, reducing bias in model updates.
Advanced strategies such as adaptive parameterizations, noise tempering, and hardware co-design enable near-lossless performance at 2–4 bits in various applications.

Quantization-aware approximations comprise algorithms, models, and optimization techniques that explicitly account for quantization effects during training, inference, or model compression. Unlike naive post-training quantization, these methods incorporate the quantization operator—often non-differentiable—within the optimization loop or model design, yielding neural networks that maintain predictive performance under low precision arithmetic. Principal approaches include surrogate or continuous relaxations to enable meaningful gradients, constrained or regularized optimization for quantization, noise-injection techniques, learnable quantizer parameterizations, hybrid fine-tuning, and structured factorization coupled with quantization.

1. Mathematical Foundations and Surrogate Gradient Approaches

Quantization operators (e.g., rounding, clipping) are piecewise constant and non-differentiable, impeding direct gradient-based optimization. The dominant workaround is the Straight-Through Estimator (STE), which approximates the gradient of the quantizer by an identity mapping where the quantization operates. However, this leads to systematic bias. Several quantization-aware methods address the STE limitations:

Curvature-Aware Gradients: CAGE introduces a curvature-aware correction: the update

$x_{t+1} = x_t - \alpha\left[\nabla f(x_t) + \lambda(x_t - Q(x_t))\right],$

where $Q$ is the quantizer, and $\lambda$ is a trade-off coefficient. This term explicitly corrects for the loss increase due to quantization, provably converging to Pareto-optimal solutions in the non-convex regime (Tabesh et al., 21 Oct 2025).

Continuous Relaxations: “Continuous Approximations for Improving Quantization Aware Training of LLMs” replaces hard rounding and clamping with differentiable surrogates (Sigmoid STE, SoftClamp), permitting non-trivial learning of both weights and quantizer parameters. This enables finer adaptation of step sizes and overall lower perplexity and error rates (Li et al., 6 Oct 2024).
Magnitude-aware and Piecewise-linear Gradients: The Magnitude-Aware Derivative (MAD) keeps gradients nonzero but suppresses them as magnitude crosses the clipping threshold, unlike standard PWL or STE. This hybrid estimator minimizes gradient bias and variance explosion (Sakr et al., 2022).

2. Constrained Optimization and Regularization-based Techniques

Quantization can be cast as a constrained or regularized optimization problem:

Strongly Dual Constrained QAT: “Neural Networks with Quantization Constraints” frames QAT as minimizing the loss subject to explicit upper bounds on quantizer-induced errors at each layer and output. The primal-dual algorithm alternates updates of model weights and dual variables, never requiring STE. The dual variables $\lambda_l$ not only guide optimization but also quantify layer sensitivity, naturally enabling mixed-precision allocation (Hounie et al., 2022).
Piecewise-Affine Regularized Quantization (PARQ): Minimizes

$L(\theta) + \lambda R(\theta),$

where $R(\theta) = \sum_{i} \Psi(\theta_i)$ with $\Psi$ a convex, piecewise-affine penalty promoting clustering at quantizer levels. The aggregate proximal (AProx) method yields “soft to hard” quantization and recovers STE in the $\gamma \to \infty$ limit. Last-iterate convergence is proved under mild conditions (Jin et al., 19 Mar 2025).

Noise-based Smoothing and Annealing: Additive Noise Annealing (ANA) applies stochastic smoothing to the quantizer via convolution with a smooth noise kernel, making the expected quantizer output differentiable almost everywhere. Annealing the noise variance drives the network to a hard quantized solution at test time, generalizing STE and yielding universal QNN approximation results (Spallanzani et al., 2019).

3. Learnable Quantizer Parameterizations and Adaptive Modules

Accurate QAT relies on fitting quantizers to data distribution, especially for non-symmetric or non-uniform cases:

Parameterization Schemes: Asymmetric quantization eschews fixed scale+offset for more stable (min, max), (beta, gamma) parameterizations. The latter accelerates convergence and improves robustness, particularly for large models and higher bit-widths (You et al., 25 Apr 2024).
Adaptive Step Size Quantization (ASQ): Introduces micro-networks that emit dynamic, per-layer or per-channel step sizes as functions of the input activation statistics, enabling precise adaptation to layerwise distribution shifts. Together with POST (Power Of Square Root of Two) non-uniform quantization, ASQ achieves or surpasses full-precision baselines, maintaining efficient LUT-based inference (Zhou et al., 24 Apr 2025).
Optimal Clipping: OCTAV computes MSE-optimal clipping thresholds for each tensor and iteration via a Newton-Raphson fixed point recursion, while MAD/PWL hybrid gradients allow stable and near-optimal accuracy on challenging tasks (ResNet/ImageNet, BERT/SQuAD) (Sakr et al., 2022).

4. Quantization-Aware Fine-Tuning, Factorization, and Output-aware Methods

Lossless Adapter-based Fine-Tuning: LoTA-QAF defines a low-rank ternary adapter $\Delta W = A_T B_T$ (with $A_T, B_T \in \{-1,0,1\}$ ), aligned on the quantization grid, and a merging/offset protocol guaranteeing exact absorbtion into the N-bit grid. A ternary signed-SGD performs adapter optimization. This approach eliminates rounding error, preserves inference efficiency, and outperforms LoRA or prior adapters in the extreme low-bit regime (2-4 bits) (Chen et al., 24 May 2025).
Quantization-aware Factorization: ADMM-based CP factorization directly solves for decomposition factors constrained to the quantizer grid, integrating tensor factorization and quantization into a unified alternating update. Empirically, such joint factorization-quantization dominates any post-hoc quantization in accuracy-vs-compression tradeoff under aggressive constraints (Cherniuk et al., 2023).
Output-approximation and Structural Calibration (LoaQ): LoaQ fits the output of each (linear) sub-layer post-quantization to its original FP32 value by closed-form least-squares. This method exploits the linearity and residual structure of transformer blocks and achieves substantial perplexity and accuracy improvements in LLM PTQ, especially at 3-4 bits (Lin et al., 8 Sep 2025).

5. Hardware-Aware and Architecture Co-Design

Efficient hardware realization of quantization-aware approximations requires joint optimization over bit-width, quantization scheme, and accelerator architecture:

QUIDAM: Provides a co-exploration framework with parameterized hardware models (e.g., FP, INT16, barrel-shift LightPE PEs), polynomial surrogate models for performance/area/energy, and random/MCMC sampling for Pareto front discovery. Retraining under the same quantization as hardware yields up to 5.7× performance/area and 4.7× energy improvements with <1% top-1 drop at 4-8 bits (Inci et al., 2022).
Quantization-Aware NAS (FrostNet): Integrates QAT techniques (StatAssist momentum warm-start, GradBoost stochastic amplification) into both cell-level and end-to-end NAS. The Frost bottleneck, built from Conv–BN–ReLU only, is amenable to complete INT8 layer fusion. FrostNet architectures outperform all competing mobile models at matched FLOPs and actual INT8 latency, with robust QAT-from-scratch convergence (Kim et al., 2020).

6. Noise Tempering and Regularization

Additive noise, when tailored to quantization error and tempered (e.g., via exponential decay as the error increases), acts as a curvature-dependent regularizer:

Error-Aware Noise Tempering: EQNT injects decaying, quant-error-dependent Gaussian noise into the pseudo-quantizer, where the noise variance implicitly encodes the Hessian curvature and penalizes sharp minima. This method improves top-1 accuracy in aggressive quantization over LSQ and reduces sharpness of minima, generalizing to uniform and mixed-precision schemes (Wang et al., 2022).

7. Blockwise and Mixed-Precision Strategies

Quantization-aware approximation extends to hybrid and blockwise regimes:

Blockwise Replacement on Full-Precision Counterpart: Mixed-precision networks substitute quantized blocks sequentially into the full-precision backbone, yielding parallel mixed-precision branches. This enables each quantized block to both simulate FP32 behavior and receive STE and FP gradients jointly, thus mitigating gradient estimation errors and improving low-precision performance (up to +2.4% top-1 @ 2 bits) (Yu et al., 20 Dec 2024).

In sum, quantization-aware approximations have evolved from simple STE-based training to sophisticated optimization perspectives encompassing constrained, regularization, smoothing, adaptive, and structural strategies. These advances have enabled near-lossless or even super-precision performance at 2–4 bits in vision and LLM architectures, robust hardware co-design, and mathematically principled approaches that connect the non-differentiable landscape of quantization with the requirements of stochastic optimization and large-scale neural representation (Jin et al., 19 Mar 2025, Sakr et al., 2022, Chen et al., 24 May 2025, Hounie et al., 2022, Inci et al., 2022, Lin et al., 8 Sep 2025, Cherniuk et al., 2023, Tabesh et al., 21 Oct 2025, Yu et al., 20 Dec 2024, Kim et al., 2020, Li et al., 6 Oct 2024, Wang et al., 2022, You et al., 25 Apr 2024, Zhou et al., 24 Apr 2025, Spallanzani et al., 2019).