Per-vector and Vector-level Scaling (VS-Quant)
- VS-Quant is a quantization approach that computes per-vector and vector-level scale factors for small sub-vectors, reducing quantization error and improving representational fidelity.
- It integrates gain-shape decomposition, activity masking, and tailored entropy coding to balance computational efficiency, memory costs, and rate-distortion tradeoffs.
- In neural network applications, VS-Quant enhances model accuracy and energy efficiency, as demonstrated by improvements in ResNet50 and BERT performance compared to per-channel methods.
Per-vector and vector-level scaling ("VS-Quant") denote related quantization approaches in which instead of using a single scale across a tensor or channel, scale factors are computed and applied at the granularity of small sub-vectors. This methodology is central in both perceptual vector quantization (PVQ) for perceptual coding and in post-training quantization of deep neural networks. The key principle is to reduce quantization error and improve representational fidelity by assigning scale (dynamic range) locally to vector groups, with several techniques developed to balance computational efficiency, memory cost, and rate–distortion or accuracy tradeoffs (Valin et al., 2016, Dai et al., 2021).
1. Gain–Shape and Per-Vector Quantization Frameworks
In the context of AC coefficient quantization for video coding, the gain–shape vector quantization (VQ) model treats each vector as decomposed into:
- Gain (): A scalar, , encoding the overall energy or contrast of the vector (per-vector scale).
- Shape (): A unit-norm vector, , encoding the directional pattern (vector-level scale).
Quantization operates by separately encoding (using a scalar quantizer, possibly with companding for masking) and (by mapping it to a codeword on the unit sphere from a structured codebook such as Fischer’s pyramid VQ). The same paradigm translates to neural network quantization, where tensors are subdivided along a given dimension into vectors and each assigned a local scale factor (Dai et al., 2021).
2. Mathematical Formulation in Coding and Neural Networks
In PVQ for video coding (Valin et al., 2016), the mathematical formulation consists of:
- Gain Quantization: Either uniform or activity-masked companded scalar quantization,
or for masking,
- Shape Quantization: Using a codebook ; each codeword is mapped as , and encoding selects the codeword minimizing .
- Allocation: Parameter sets the codebook resolution, balancing the rate distortion.
In the VS-Quant approach for neural networks (Dai et al., 2021):
- Tensor decomposition into vectors: E.g., subdividing the channel dimension of into vectors of length .
- Per-vector scale: For vector , set
with .
- Vector-MAC: Dot-product is computed over quantized vectors and rescaled as
3. Two-Level Quantization and Scaling Implementations
To mitigate per-vector scale storage overhead in neural network accelerators, VS-Quant introduces a two-level scale quantization (Dai et al., 2021):
- Compute floating-point per-vector scales for each vector.
- Global (per-channel) shared scale is determined from the maximal within channel .
- Per-vector scales are quantized to -bit integers: .
- Dequantization reconstructs approximate scales: , applied to the integer quantized data.
Calibration of these scales is by max absolute value, percentile, or entropy methods, though for small vector sizes (16–64), maximum is typically used due to statistical stability.
In the context of PVQ, explicit gain quantization enables activity-masked (contrast-masking) companding, where local distortion and quantization is solved to accommodate perceptual tolerance as a power-law function of gain (Valin et al., 2016).
4. Rate–Distortion and Accuracy Implications
The benefits of per-vector and vector-level scaling are quantitatively substantial across both domains:
- PVQ Video Coding: Replacing scalar quantization with PVQ plus masking yields, on still images (50 images, 1MP), +0.90 dB PSNR, a 24.8% bitrate reduction at fixed PSNR; on video sequences (CIF–720p), +0.83 dB, a 13.7% bitrate reduction. These gains accrue from non-redundant separation of gain and shape, robust prediction methods, perceptual masking, and efficient entropy coding (Valin et al., 2016).
- Neural Network Quantization: Per-vector scaling (VS-Quant) achieves marked accuracy improvements over per-channel quantization with the same or fewer bits. For instance, with 4-bit weights and activations on ResNet50, VS-Quant yields 75.28% top-1 ImageNet accuracy compared to 70.76% for per-channel, and similarly large F1 improvements are observed for BERT models on SQuAD. Hardware implementation produces 26–37% area and 24–43% energy savings relative to an 8-bit per-channel baseline (Dai et al., 2021).
| Scheme | Model & Dataset | Acc./F1 (VS-Quant) | Acc./F1 (Per-Channel) | Area/Energy Reduction |
|---|---|---|---|---|
| 4bW+4bA | ResNet50 ImageNet | 75.28% | 70.76% | 37%/24% |
| 4bW+8bA | BERT-base SQuAD | 86.24% | 73.61% | 26% area |
| 4bW+8bA | BERT-large SQuAD | 90.64% | 83.18% | 26% area |
A plausible implication is that matching quantization granularity to the native vector-MAC size in hardware affords both statistical and architectural efficiency not possible at coarser tensor or channel scales (Dai et al., 2021).
5. Integration with Prediction and Masking
In PVQ for video coding, prediction is integrated via geometric transformation (Householder reflection) rather than simple residual computation, allowing for gain (contrast) to be conserved. The process is as follows:
- Align the prediction vector with a coordinate axis through reflection.
- Quantize the angle between and , and encode the residual shape on the orthogonal subspace using the same PVQ codebook principles.
- Local distortion tuning via explicit gain companding directly incorporates perceptual masking, with power-law companding relating the quantizer step to local contrast.
Such prediction techniques maintain the energy conservation property fundamental to PVQ, distinguishing it from scalar residual approaches (Valin et al., 2016).
6. Entropy Coding and Practical Implementation
PVQ codewords are non-uniformly distributed, necessitating tailored entropy models:
- Magnitude Modeling: For large remaining pulse counts, encode symbol magnitudes with an expected Laplacian magnitude model, parameterized by adaptive statistics.
- Run-Length Modeling: For one or few pulses left, encode zero-run lengths using either Laplacian or geometric distributions.
This statistical model leverages knowledge of the fixed “pulse-count” per vector, promptly reflecting the underlying combinatorial structure of the codebook, unlike scalar quantization where coefficients are modeled independently. A consequence is substantial reduction in redundancy and direct exploitation of codebook structure for bitrate savings (Valin et al., 2016).
In hardware, VS-Quant requires modest modifications to multiplier lanes: scale products and aggregation buffers are implemented per vector, and the additional storage overhead is minimized via the two-level quantization scheme. For bits and vector length , the per-vector scale overhead is approximately 6.25% relative to input data (Dai et al., 2021).
7. Comparative Analysis and Limitations
Comparisons with per-tensor and per-channel scaling show that coarser quantization granularity forces all entries in a section of the tensor to share a scale, leading to inflated quantization noise for most elements. Per-vector scaling, by contrast, matches the scale range more closely to local data statistics, minimizing error for small-vector quantization. However, the approach entails moderate increases in logic complexity for scale application and, without a multilevel quantization scheme, scale storage can become significant. Calibration for scale computation also becomes less robust as vector size decreases due to limited sample statistics (Dai et al., 2021).
A plausible implication is that, despite the increased granularity and complexity, per-vector scaling provides an avenue to unlock both new hardware efficiency regimes and model compression ratios not accessible to traditional per-channel or tensor approaches. Integration of contrast masking, efficient entropy coding, and robust prediction are essential enabling technologies for realizing these gains in real-world codecs and accelerators (Valin et al., 2016, Dai et al., 2021).