Zero-Shot Dynamic Quantization for Transformers

Updated 3 February 2026

Zero-shot dynamic quantization is a post-training technique that converts Transformer weights and activations to 8-bit integers without requiring calibration data or retraining.
It employs a dynamic TM-IQR clipping rule that computes per-token activation ranges to mitigate outlier effects and preserve model accuracy.
The approach reduces memory footprint up to 4× and maintains inference throughput with less than 2% slowdown on commodity hardware.

Zero-shot dynamic quantization is a post-training quantization technique for Transformer inference that reduces memory and compute precision to 8-bit integers with negligible impact on accuracy, without requiring calibration data or model retraining. This approach performs quantization purely at run time, relying on dynamically estimated activation ranges via a statistical outlier clipping rule, enabling efficient low-precision operation of BERT-like and related models on commodity hardware (El-Kurdi et al., 2022).

1. Run-Time Quantization Algorithm

Zero-shot dynamic quantization separates weight and activation quantization. Weight matrices are quantized once at model load. Activation tensors are quantized on the fly using dynamically computed clipping thresholds, with no dependence on held-out calibration sets.

1.1 Symmetric Uniform Quantization

For a real-valued matrix $\mathcal{M}\in\mathbb{R}^{m\times n}$ (weights or activations), the quantizer uses $b=8$ bits, symmetric around zero. Fundamental steps:

Quantization span (QS):

$\mathrm{QS} = \max_{i,j} |\mathcal{M}_{i,j}|$

Scale factor:

$s = \frac{2^{b-1}-1}{\mathrm{QS}} = \frac{127}{\mathrm{QS}}$

Quantization to signed 8-bit:

$\bar{\mathcal{M}}_{i,j} = \mathrm{clip}_{[-127,127]}\left( \mathrm{round}(\mathcal{M}_{i,j}\times s) \right)$

Dequantization to FP32:

$\widehat{\mathcal{M}}_{i,j} = \frac{\bar{\mathcal{M}}_{i,j}}{s} = \bar{\mathcal{M}}_{i,j}\ \frac{\mathrm{QS}}{127}$

These formulas ensure that zero remains exactly representable (zero-point is fixed at 0).

1.2 Outlier-Robust Clipping: TM-IQR

Direct use of $\max|A|$ for activation range estimation causes degradation due to outlier sensitivity. The Token-Maximums Interquartile Range (TM-IQR) approach adapts Tukey's outlier rule for fast, per-inference outlier rejection:

For each row (token) $i=1 \ldots L$ in $A\in\mathbb{R}^{L\times H}$ , compute $M_i = \max_{j} |A_{i,j}|$ .
Let $q_1, q_3$ be the first and third quartiles of $\{M_1,\ldots,M_L\}$ .
Outlier threshold:

$t = q_3 + 1.5(q_3 - q_1)$

Clip every $A_{i,j}$ to $[-t, t]$ .
Recompute $\mathrm{QS} = \max_{i,j} |A_{i,j}|$ on the clipped $A$ , then proceed as above.

The TM-IQR procedure runs in $O(L \log L + L H)$ per activation, adding less than 2% throughput overhead in practice.

2. Mathematical Formulation

2.1. Dynamic Range and Scale

$\mathrm{QS} = \max_{1\leq i\leq m,\,1\leq j\leq n} |\mathcal{M}_{i,j}|,\qquad s = \frac{127}{\mathrm{QS}}$

2.2. Quantization and Dequantization

$\bar{\mathcal{M}}_{i,j} = \mathrm{clip}_{[-127,127]}\left( \mathrm{round}(s \mathcal{M}_{i,j}) \right),\quad \widehat{\mathcal{M}}_{i,j} = \frac{\bar{\mathcal{M}}_{i,j}}{s}$

2.3. TM-IQR Outlier Threshold

$M_i = \max_{1\leq j \leq H} |A_{i,j}|,\quad q_1 = \text{1st quartile},\quad q_3 = \text{3rd quartile}$

$t = q_3 + 1.5(q_3 - q_1),\quad A_{i,j} \leftarrow \min(\max(A_{i,j}, -t), t)$

2.4. Integer Matrix Multiplication (GEMM) and Rescaling

Given quantized inputs $\bar{A}$ , $\bar{W}$ :

$Z^{\text{int32}} = \bar{A} \times_{\text{int}} \bar{W} \implies Z^{\text{FP32}} = Z^{\text{int32}} \cdot \frac{1}{s_A s_W}$

3. Integration with Transformer Layer Execution

A standard Transformer encoder layer involves multi-head attention (two GEMMs), normalization layers, and a feedforward subnetwork (FFN) with two GEMMs flanking a nonlinearity (ReLU/GELU).

At model-load time: All weight matrices $W$ are quantized to $\bar{W}$ using the precomputed scale $s_W$ .
At run time, for each quantized GEMM:
- The input activation $A$ is taken.
- If the GEMM is the second FFN GEMM (the step contributing most quantization error), TM-IQR outlier clipping is applied.
- Compute the activation scale $s_A$ on (possibly clipped) $A$ .
- Quantize activations $A \rightarrow \bar{A}$ .
- Perform int8 GEMM ( $\bar{A} \times \bar{W}$ ), accumulate in int32.
- Dequantize output using $1/(s_A s_W)$ .

In principle, TM-IQR could be applied before every GEMM, but empirical results show restricting it to the FFN's second GEMM achieves most of the available improvement with minimal extra cost.

4. Experimental Results

4.1 Hardware and Throughput

48-core Intel Xeon 8260 CPU:
- int8 GEMM without TM-IQR: 29,005 words/sec
- int8 GEMM with TM-IQR: 28,640 words/sec (less than 2% slowdown)
NVIDIA V100 (fp16 reference): 71,998 words/sec

4.2 Benchmarks and Models

GLUE suite (MNLI, MNLI-MM, CoLA, SST-2, MRPC, STS-B, QQP, QNLI, RTE)
- Models: BERT-base-cased, BERT-large-cased, RoBERTa-base, RoBERTa-large
Question Answering: TyDI-QA (11 languages), Natural Questions
- Models: XLM-R-base, XLM-R-large

4.3 Results Summary

Model / Dataset	FP32	int8 w/o TM-IQR	+TM-IQR	Δ vs FP32
BERT-base	82.0	80.3	81.8	–0.2
BERT-large	84.4	83.0	84.0	–0.4
RoBERTa-base	82.0	75.2	80.8	–1.2
RoBERTa-large	87.1	86.6	86.7	–0.4
TyDI-base (XLM-R)	67.7	62.9	67.0	–0.7
TyDI-large	68.8	66.8	68.4	–0.4
NQ-base	54.6	48.0	53.4	–1.2
NQ-large	56.6	53.3	56.1	–0.5

Memory footprint reduction is roughly $4\times$ (8-bit vs. 32-bit).
Inference throughput is hardly impacted by TM-IQR dynamic quantization.

5. Limitations and Prospective Directions

The TM-IQR rule uses a fixed multiplier (1.5) for all layers and activation types, which may not be optimal for all cases; tuning or learning the value may yield further robustness.
Currently, only the FFN's second GEMM is subject to dynamic outlier clipping; more widespread application (e.g., all GEMMs) could close the accuracy gap further, albeit with higher runtime cost.
Symmetric quantization with zero zero-point may be suboptimal for strictly positive activations (as in ReLU/GELU), and asymmetric quantization could be explored.
Mixed precision and per-channel quantization remain unexplored in this scheme, presenting future avenues for model compression.
Incorporating a lightweight, unlabeled calibration stage to tune the IQR multiplier, or extending the TM-IQR quantization approach to attention weights and outputs, are possible future enhancements.

In summary, zero-shot dynamic quantization enables automatic 8-bit inference of off-the-shelf Transformer models via a simple yet effective statistical outlier clipping mechanism, achieving a balance between memory reduction and accuracy preservation, all without calibration data or retraining (El-Kurdi et al., 2022).

Markdown Upgrade to Chat

References (1)

Zero-Shot Dynamic Quantization for Transformer Inference (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero-Shot Dynamic Quantization.

Zero-Shot Dynamic Quantization for Transformers

1. Run-Time Quantization Algorithm

1.1 Symmetric Uniform Quantization

1.2 Outlier-Robust Clipping: TM-IQR

2. Mathematical Formulation

2.1. Dynamic Range and Scale

2.2. Quantization and Dequantization

2.3. TM-IQR Outlier Threshold

2.4. Integer Matrix Multiplication (GEMM) and Rescaling

3. Integration with Transformer Layer Execution

4. Experimental Results

4.1 Hardware and Throughput

4.2 Benchmarks and Models

4.3 Results Summary

5. Limitations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Zero-Shot Dynamic Quantization for Transformers

1. Run-Time Quantization Algorithm

1.1 Symmetric Uniform Quantization

1.2 Outlier-Robust Clipping: TM-IQR

2. Mathematical Formulation

2.1. Dynamic Range and Scale

2.2. Quantization and Dequantization

2.3. TM-IQR Outlier Threshold

2.4. Integer Matrix Multiplication (GEMM) and Rescaling

3. Integration with Transformer Layer Execution

4. Experimental Results

4.1 Hardware and Throughput

4.2 Benchmarks and Models

4.3 Results Summary

5. Limitations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research