Zero-Shot Dynamic Quantization for Transformers
- Zero-shot dynamic quantization is a post-training technique that converts Transformer weights and activations to 8-bit integers without requiring calibration data or retraining.
- It employs a dynamic TM-IQR clipping rule that computes per-token activation ranges to mitigate outlier effects and preserve model accuracy.
- The approach reduces memory footprint up to 4× and maintains inference throughput with less than 2% slowdown on commodity hardware.
Zero-shot dynamic quantization is a post-training quantization technique for Transformer inference that reduces memory and compute precision to 8-bit integers with negligible impact on accuracy, without requiring calibration data or model retraining. This approach performs quantization purely at run time, relying on dynamically estimated activation ranges via a statistical outlier clipping rule, enabling efficient low-precision operation of BERT-like and related models on commodity hardware (El-Kurdi et al., 2022).
1. Run-Time Quantization Algorithm
Zero-shot dynamic quantization separates weight and activation quantization. Weight matrices are quantized once at model load. Activation tensors are quantized on the fly using dynamically computed clipping thresholds, with no dependence on held-out calibration sets.
1.1 Symmetric Uniform Quantization
For a real-valued matrix (weights or activations), the quantizer uses bits, symmetric around zero. Fundamental steps:
- Quantization span (QS):
- Scale factor:
- Quantization to signed 8-bit:
- Dequantization to FP32:
These formulas ensure that zero remains exactly representable (zero-point is fixed at 0).
1.2 Outlier-Robust Clipping: TM-IQR
Direct use of for activation range estimation causes degradation due to outlier sensitivity. The Token-Maximums Interquartile Range (TM-IQR) approach adapts Tukey's outlier rule for fast, per-inference outlier rejection:
- For each row (token) in , compute .
- Let be the first and third quartiles of .
- Outlier threshold:
- Clip every to .
- Recompute on the clipped , then proceed as above.
The TM-IQR procedure runs in per activation, adding less than 2% throughput overhead in practice.
2. Mathematical Formulation
2.1. Dynamic Range and Scale
2.2. Quantization and Dequantization
2.3. TM-IQR Outlier Threshold
2.4. Integer Matrix Multiplication (GEMM) and Rescaling
Given quantized inputs , :
3. Integration with Transformer Layer Execution
A standard Transformer encoder layer involves multi-head attention (two GEMMs), normalization layers, and a feedforward subnetwork (FFN) with two GEMMs flanking a nonlinearity (ReLU/GELU).
- At model-load time: All weight matrices are quantized to using the precomputed scale .
- At run time, for each quantized GEMM:
- The input activation is taken.
- If the GEMM is the second FFN GEMM (the step contributing most quantization error), TM-IQR outlier clipping is applied.
- Compute the activation scale on (possibly clipped) .
- Quantize activations .
- Perform int8 GEMM (), accumulate in int32.
- Dequantize output using .
In principle, TM-IQR could be applied before every GEMM, but empirical results show restricting it to the FFN's second GEMM achieves most of the available improvement with minimal extra cost.
4. Experimental Results
4.1 Hardware and Throughput
- 48-core Intel Xeon 8260 CPU:
- int8 GEMM without TM-IQR: 29,005 words/sec
- int8 GEMM with TM-IQR: 28,640 words/sec (less than 2% slowdown)
- NVIDIA V100 (fp16 reference): 71,998 words/sec
4.2 Benchmarks and Models
- GLUE suite (MNLI, MNLI-MM, CoLA, SST-2, MRPC, STS-B, QQP, QNLI, RTE)
- Models: BERT-base-cased, BERT-large-cased, RoBERTa-base, RoBERTa-large
- Question Answering: TyDI-QA (11 languages), Natural Questions
- Models: XLM-R-base, XLM-R-large
4.3 Results Summary
| Model / Dataset | FP32 | int8 w/o TM-IQR | +TM-IQR | Δ vs FP32 |
|---|---|---|---|---|
| BERT-base | 82.0 | 80.3 | 81.8 | –0.2 |
| BERT-large | 84.4 | 83.0 | 84.0 | –0.4 |
| RoBERTa-base | 82.0 | 75.2 | 80.8 | –1.2 |
| RoBERTa-large | 87.1 | 86.6 | 86.7 | –0.4 |
| TyDI-base (XLM-R) | 67.7 | 62.9 | 67.0 | –0.7 |
| TyDI-large | 68.8 | 66.8 | 68.4 | –0.4 |
| NQ-base | 54.6 | 48.0 | 53.4 | –1.2 |
| NQ-large | 56.6 | 53.3 | 56.1 | –0.5 |
- Memory footprint reduction is roughly (8-bit vs. 32-bit).
- Inference throughput is hardly impacted by TM-IQR dynamic quantization.
5. Limitations and Prospective Directions
- The TM-IQR rule uses a fixed multiplier (1.5) for all layers and activation types, which may not be optimal for all cases; tuning or learning the value may yield further robustness.
- Currently, only the FFN's second GEMM is subject to dynamic outlier clipping; more widespread application (e.g., all GEMMs) could close the accuracy gap further, albeit with higher runtime cost.
- Symmetric quantization with zero zero-point may be suboptimal for strictly positive activations (as in ReLU/GELU), and asymmetric quantization could be explored.
- Mixed precision and per-channel quantization remain unexplored in this scheme, presenting future avenues for model compression.
- Incorporating a lightweight, unlabeled calibration stage to tune the IQR multiplier, or extending the TM-IQR quantization approach to attention weights and outputs, are possible future enhancements.
In summary, zero-shot dynamic quantization enables automatic 8-bit inference of off-the-shelf Transformer models via a simple yet effective statistical outlier clipping mechanism, achieving a balance between memory reduction and accuracy preservation, all without calibration data or retraining (El-Kurdi et al., 2022).