Incremental Network Quantization (INQ)
- INQ is a set of techniques that incrementally quantize neural network weights and activations, preserving accuracy while reducing memory and computation costs.
- It leverages per‐vector scaled quantization (VS-Quant), which assigns dedicated scale factors to fixed-length weight groups to optimize dynamic range and minimize quantization error.
- INQ’s methods improve hardware efficiency and energy consumption, demonstrating significant performance gains in models like ResNet-50 and BERT through reduced bitwidth operations.
Incremental Network Quantization (INQ) refers to a class of techniques that enable the deployment of neural network models with quantized (low-precision) weights and activations, balancing memory footprint reduction, computational efficiency, and minimal loss in inference accuracy. Among prominent INQ methods is vector-level or per-vector scaled quantization, often termed VS-Quant (Editor's term), which assigns dedicated quantization scale factors to small vectors—collections of elements—rather than applying a single scale across entire tensors or channels (Dai et al., 2021). This strategy has seen adoption in both neural network inference and signal compression, with substantial impact on hardware efficiency and rate-distortion performance.
1. Fundamental Principles of Per-Vector Scaled Quantization
VS-Quant operates by partitioning tensors such as neural network weights or activation maps into non-overlapping vectors of fixed length (typically $16$–$64$), computing a dedicated scale factor for each vector. Given a tensor with elements (where indexes coarse groups such as output channels, the vector, and the element within the vector), the per-vector scale is: with as the target bitwidth. Each element is quantized independently within its vector using this scale, improving the dynamic range allocation and reducing quantization error compared to per-tensor or per-channel scaling. Dequantization is performed by multiplying the quantized values by (Dai et al., 2021).
This per-vector scaling concept aligns with per-vector gain-shape decomposition in perceptual vector quantization for video coding, where the vector is decomposed into a gain and a normalized shape , with each handled by a separate quantization and coding process (Valin et al., 2016).
2. Two-Level Quantization for Scale Factors
To minimize hardware cost and memory overhead, VS-Quant employs a two-level quantization for scale factors. Each floating-point per-vector scale is further quantized into a low-bit integer and an accompanying per-group floating-point scale : is typically computed as
where is the maximum for group over all vectors , and is the number of scale bits. The vector scale quantizer then becomes: Clamping ensures . With this scheme, the reconstructed float for element is: This multilevel quantization approach maintains high fidelity while significantly reducing the hardware area and energy used to store and apply scaling, with only minimal increase in metadata storage. Typical overhead for and is additional bits per vector (Dai et al., 2021).
3. Error Reduction Relative to Coarser Granularity Scaling
Traditional quantization methods apply a single scale per-tensor or per-channel. These methods are efficient in terms of parameter overhead but suffer under reduced precision when tensor ranges vary widely. VS-Quant shrinks the dynamic range each scale must cover, markedly reducing the quantization error without requiring retraining:
- For 4-bit weight and activation quantization in ResNet-50 on ImageNet, per-channel scaling yields 70.76% Top-1 accuracy; VS-Quant increases this to 75.28%, a 4.5-point improvement (Dai et al., 2021).
- In BERT-large on SQuAD v1.1 under 4-bit quantization for weights and 8-bits for activations, per-channel scaling attains 83.18% F1, while per-vector scaling achieves 90.64% F1.
A plausible implication is that per-vector scaling enables deployment of integer-only inference at bitwidths that per-channel scaling cannot attain without significant accuracy loss, extending applicability to more extreme quantization regimes (Dai et al., 2021).
4. Hardware Implementation and Efficiency
VS-Quant's two-level scaling scheme maps naturally to vectorized digital hardware. In a typical implementation, packed vectors of quantized weights or activations are fetched together with their respective low-bit integer scales. Vectorized multiply-accumulate operations proceed on integer values, and the accumulated result is rescaled by the product of the integer scales for both weights and activations:
- Compute integer dot product
- Compute combined scale
- Final sum:
Area and energy measurements for a 16 nm FinFET implementation of ResNet-50 show that 4/4 per-vector quantization with 4-bit scales achieves area baseline and energy baseline. With optimized rounding and larger scale width, energy can be reduced to baseline. For BERT-base and BERT-large, area is lowered by with negligible loss in accuracy (Dai et al., 2021).
5. Quantization-Aware Retraining and Pareto Analysis
Although the bulk of VS-Quant's benefits are realized in post-training quantization (no retraining), additional gains are possible through quantization-aware retraining (QAT). Incorporating a vector-scale quantizer and training for $10$–$20$ epochs, accuracy approaches full-precision performance even at 3–4 bit precision:
- ResNet-50 with 3/3 bits: 75.5% (VS-Quant with QAT) vs 72.0% (per-channel QAT)
- BERT-large 3/8 bits: 90.6% (VS-Quant with QAT) vs 88.3% (per-channel QAT) (Dai et al., 2021)
Exploration of the Pareto-optimal design space reveals that VS-Quant configurations populate the low-area, low-energy, and high-accuracy regions more densely than per-channel methods.
6. Analogies to Perceptual Vector Quantization in Signal Coding
VS-Quant shares formal similarities with vector-level quantization frameworks in video coding, notably gain-shape decomposition as in Daala's perceptual vector quantization (PVQ). There, per-vector scaling corresponds to quantizing the energy (gain) in a coefficient block, while vector-level scaling quantizes the normalized shape (Valin et al., 2016). This decomposition supports advanced perceptual optimization—such as explicit contrast masking—and enables efficient entropy coding by leveraging knowledge of pulse-count constraints and distributional assumptions. PVQ with gain-shape quantization yields average PSNR gains of 0.90 dB (stills) and 0.83 dB (video), corresponding to 24.8% and 13.7% bitrate reductions, respectively.
A plausible implication is that vector-level scaling provides unified analytical foundations for both neural network quantization and advanced lossy compression, demonstrating cross-domain utility where vector dynamic range heterogeneity must be managed efficiently (Dai et al., 2021, Valin et al., 2016).
7. Practical and Theoretical Impact
VS-Quant's demonstrated ability to recover near-full-precision accuracy at ultra-low bitwidths ($4$–$6$ bits), with marginal metadata and hardware overhead, makes it suitable for compute- and energy-constrained inference deployments. Its adoption facilitates efficient DNN execution on vector processors, accelerators, and embedded systems. The associated rate-distortion improvements in signal compression point to further research directions, including adaptive scale factor allocation, integration with learned quantization approaches, and unified frameworks for joint neural and perceptual optimization (Dai et al., 2021, Valin et al., 2016).