Papers
Topics
Authors
Recent
Search
2000 character limit reached

Element-Wise Quantization

Updated 6 February 2026
  • Element-wise quantization is a method that quantizes each weight or activation individually using learned or adaptive parameters for precision.
  • It employs differentiable approximations or sensitivity-guided thresholds to align quantization with network loss, achieving state-of-the-art efficiency–accuracy trade-offs.
  • Applications span vision and NLP, where techniques like Quantization Networks, FlexRound, and EPTQ demonstrate near-lossless performance even at low bit-widths.

Element-wise quantization refers to quantization methods in which each weight or activation in a neural network is quantized individually, with quantization parameters that may be learned or adapted at the per-element (or at least per-layer/channel) level. This paradigm stands in contrast to classical scalar and vector quantization schemes where the quantization function is fixed globally or per-tensor. Recent research demonstrates that finely tuned, learnable, or sensitivity-aware element-wise quantizers can provide state-of-the-art efficiency–accuracy trade-offs in both quantization-aware training (QAT) and post-training quantization (PTQ), especially in low-bit regimes for both vision and LLMs (Yang et al., 2019, Gordon et al., 2023, Lee et al., 2023).

1. Mathematical Formulations of Element-Wise Quantization

Quantization Networks (Yang et al., 2019) define a differentiable element-wise quantization function for any scalar input x∈Rx \in \mathbb{R} (weight or activation) with a set of n+1n+1 discrete quantization levels Y={y1,…,yn+1}Y = \{ y_1, \ldots, y_{n+1} \}. The inference-time quantizer is implemented as a sum of hard step functions: y=∑i=1nsi⋅A(βx−bi)−oy = \sum_{i=1}^{n} s_i \cdot A(\beta x - b_i) - o where A(u)=1{u≥0}A(u) = 1\{u \geq 0\}, si=Yi+1−Yis_i = Y_{i+1} - Y_i, bib_i are learned thresholds, β\beta is a learned input scale, and o=12∑isio = \frac{1}{2} \sum_i s_i ensures the quantizer is zero-centered. During training, each step function A(⋅)A(\cdot) is replaced by a scaled sigmoid σ\sigma, yielding a smooth, everywhere differentiable approximation: Q(x)=α(∑i=1nsiσ[T(βx−bi)]−o)Q(x) = \alpha\left(\sum_{i=1}^{n} s_i \sigma[T(\beta x - b_i)] - o\right) where α\alpha rescales the output and TT controls temperature/steepness.

FlexRound (Lee et al., 2023) introduces an alternative weight quantization operator defined by element-wise division: W^=s1⌊W⊘(s1⊙S′)⌉\hat{W} = s_1 \left\lfloor W \oslash \left(s_1 \odot S' \right)\right\rceil where s1s_1 is a learnable common grid size, and S′S', of the same shape as WW, are per-element learnable scaling factors. For convolutions, S′S' can be factorized into per-output/channel/weight components for efficiency. This division-based operator increases the granularity of adaptation across weights, tailored to local magnitude and distribution.

EPTQ (Gordon et al., 2023) frames element-wise quantization sensitivity via the Hessian diagonal: for each element wi(ℓ)w^{(\ell)}_i in layer ℓ\ell, the quantization error's impact on task loss is tightly upper bounded using the curvature-scaled squared error: L(ℓ)≲c∑ihi(ℓ)(wi(ℓ)−qi)2L^{(\ell)} \lesssim c \sum_i h^{(\ell)}_i (w^{(\ell)}_i - q_i)^2 with hi(ℓ)h_i^{(\ell)} the per-element Hessian (Gauss–Newton) diagonal, qiq_i the quantized value, and cc a global bound. This motivates a quantizer design and parameter search that pays more fidelity to weights with higher loss sensitivity.

2. Learning and Optimization Strategies

In Quantization Networks, all quantizer parameters (α,β,b)(\alpha, \beta, b) for weights and activations in each network module are optimized jointly by back-propagation; the discrete levels YY are fixed, but thresholds bib_i and scales are directly learned. Gradient propagation is exact, as the quantizer is smooth during training. A temperature annealing schedule progressively sharpens the quantizer, bridging performance from soft to hard quantization.

FlexRound employs the straight-through estimator (STE) for the non-differentiable rounding operation but leverages the division-based structure for parameter updates. Gradients for each Si,j′S'_{i,j} scale are: ∂L∂Si,j′=−Wi,j(Si,j′)2∂L∂W^i,j\frac{\partial L}{\partial S'_{i,j}} = -\frac{W_{i,j}}{(S'_{i,j})^2} \frac{\partial L}{\partial \hat{W}_{i,j}} indicating large-magnitude weights yield larger, more flexible updates. s1s_1 and all scales can be learned using standard optimizers such as Adam, with weights kept fixed.

EPTQ, targeting PTQ scenarios with limited calibration data, uses Hessian-guided selection of quantization thresholds and bit-widths through a layerwise minimization of the Hessian-weighted MSE: Hmse(w(ℓ),t)=∑ihi(ℓ)(wi(ℓ)−Qt(wi(ℓ)))2\mathrm{Hmse}(w^{(\ell)},t) = \sum_i h^{(\ell)}_i (w^{(\ell)}_i - Q_t(w^{(\ell)}_i))^2 and further performs a network-wise rounding optimization via knowledge distillation, weighting per-layer and per-sample errors by sensitivity scores derived from the sample–layer Hessian attention.

3. Gradient Handling and Differentiability

Quantization Networks distinguish themselves by completely avoiding STEs; the soft quantizer is everywhere differentiable, with closed-form gradients: ∂ℓ∂x=∂ℓ∂y∑i=1nTβαsigi(αsi−gi)\frac{\partial \ell}{\partial x} = \frac{\partial \ell}{\partial y} \sum_{i=1}^n \frac{T \beta}{\alpha} s_i g_i (\alpha s_i - g_i) where gi=σ[T(βx−bi)]g_i = \sigma[T (\beta x - b_i)]. This enables exact gradient transmission and preserves learning stability for all quantized parameters.

FlexRound, by contrast, uses STE for the rounding operator but achieves robust learning because the division form allows the update of scale factors to directly exploit the magnitude of the underlying weights. The reciprocal rule in gradient computation ensures that larger weights receive proportionally larger gradient updates to their associated scales, making the quantization grid highly adaptive.

EPTQ's differentiability analysis is focused on Hessian estimation and is not intended for end-to-end QAT, but instead for closed-form and iterative optimization in a PTQ context.

4. Implementation and Practical Deployment

Element-wise quantization is implemented in practice by directly inserting the quantization function after each weight (or activation) in all relevant layers (Yang et al., 2019). This design is equivalent in computational cost to element-wise activation functions (e.g., ReLU, sigmoid), with only minor per-module overhead for parameter multiplications.

In FlexRound, PTQ is performed block-by-block (as done in BRECQ and related block reconstruction methods): for each block, full-precision outputs on a small calibration set are collected, then scale factors and grid size are optimized to minimize block output error. Quantization of activations (typically per-tensor) can be conducted in tandem with weight quantization.

EPTQ deployment involves four main steps: batch-norm folding, layerwise Hessian diagonal estimation (via Hutchinson stochastic trace estimation), threshold search for optimal Hmse minimization, and global rounding optimization via distillation loss with sample–layer attention. This process allows efficient deployment with small calibration sets, even on models sensitive to quantization noise, such as large vision or language architectures.

5. Empirical Results and Benchmark Comparisons

Quantization Networks demonstrate near-lossless accuracy down to 3-bit weights with 32-bit activations, e.g., on ImageNet with ResNet-18, achieving 70.4% top-1 (full-precision baseline: 70.3%) and lossless quantization for ResNet-50 at 3 bits. Similar results hold for detection tasks (e.g., Pascal VOC SSD 3-bit: 77.7% mAP vs. baseline 77.8%). The approach remains highly competitive even at 1–2 bits (Yang et al., 2019).

FlexRound evaluates across vision and NLP benchmarks, showing increased accuracy relative to AdaRound and BRECQ methods. For instance, on ImageNet with ResNet-18 at 4 bits (weights only), FlexRound achieves 70.28% top-1 versus 70.18% for AdaRound (Lee et al., 2023). In LLMs, such as LLaMA-33B, FlexRound (8/8-bit) yields 69.08% on BoolQ, surpassing AdaRound’s 64.86%. Performance is robust with small calibration sets.

EPTQ consistently outperforms previous block- or layer-wise PTQ methods across vision models. For ResNet-18 at 4/4 bits (weights/activations), EPTQ reaches 69.91% compared to BRECQ’s 69.60%; at 3/3 bits, EPTQ achieves 67.84% vs. QDrop’s 65.65%. Ablation confirms that Hessian-weighted thresholding and sample–layer attention each provide substantial gains, especially as bit-width is reduced (Gordon et al., 2023).

6. Theoretical Justification and Properties

Element-wise quantization improves over uniform or layerwise schemes primarily by:

  • Learning quantizer thresholds and scales jointly with the network, allowing adaptation to the data distribution’s local structure (e.g., via k-means or data-driven initialization) (Yang et al., 2019).
  • Utilizing Hessian-based sensitivity to prioritize accurate representation of weights with the greatest impact on task loss (Gordon et al., 2023).
  • Implementing adaptive granularity: per-element (FlexRound), per-channel, or mixed schemes deliver tighter, non-uniform quantization intervals in high-sensitivity regions of the parameter space.

This approach eliminates gradient mismatch (where present) by either smooth parameterization (Quantization Networks) or by leveraging sensitivity-driven coordinate-wise optimization (EPTQ), avoiding heuristic or ad-hoc STE hacks.

7. Comparative Summary

Method Quantizer Formulation Gradient Handling PTQ/QAT Typical Parameters Empirical Outcome
Quantization Networks Smooth sum-of-steps, learned thresholds Exact, closed-form, temperature anneal QAT (α,β,b)(\alpha, \beta, \mathbf{b}) per module Near-lossless to 3 bits, avoids gradient mismatch
EPTQ Uniform quantization, Hessian-guided Sensitivity analysis, non-end-to-end PTQ Per-element threshold, Hmse weighting SOTA accuracy for 3–4 bit PTQ, best with small calibration sets
FlexRound Element-wise division-based learnable rounding STE for rounding, adaptive grid via division PTQ s1s_1 (global), S′S', S2S_2, S3S_3 (scales) Outperforms AdaRound and BRECQ in vision, NLP, LLMs

Element-wise quantization, through the combination of fine-grained adaptivity, differentiable or sensitivity-driven optimization, and compatibility with both QAT and PTQ, now defines the frontier of efficient quantized model deployment (Yang et al., 2019, Gordon et al., 2023, Lee et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Element-Wise Quantization.