Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zero-Shot Dynamic Quantization for Transformers

Updated 3 February 2026
  • Zero-shot dynamic quantization is a post-training technique that converts Transformer weights and activations to 8-bit integers without requiring calibration data or retraining.
  • It employs a dynamic TM-IQR clipping rule that computes per-token activation ranges to mitigate outlier effects and preserve model accuracy.
  • The approach reduces memory footprint up to 4× and maintains inference throughput with less than 2% slowdown on commodity hardware.

Zero-shot dynamic quantization is a post-training quantization technique for Transformer inference that reduces memory and compute precision to 8-bit integers with negligible impact on accuracy, without requiring calibration data or model retraining. This approach performs quantization purely at run time, relying on dynamically estimated activation ranges via a statistical outlier clipping rule, enabling efficient low-precision operation of BERT-like and related models on commodity hardware (El-Kurdi et al., 2022).

1. Run-Time Quantization Algorithm

Zero-shot dynamic quantization separates weight and activation quantization. Weight matrices are quantized once at model load. Activation tensors are quantized on the fly using dynamically computed clipping thresholds, with no dependence on held-out calibration sets.

1.1 Symmetric Uniform Quantization

For a real-valued matrix MRm×n\mathcal{M}\in\mathbb{R}^{m\times n} (weights or activations), the quantizer uses b=8b=8 bits, symmetric around zero. Fundamental steps:

  • Quantization span (QS):

QS=maxi,jMi,j\mathrm{QS} = \max_{i,j} |\mathcal{M}_{i,j}|

  • Scale factor:

s=2b11QS=127QSs = \frac{2^{b-1}-1}{\mathrm{QS}} = \frac{127}{\mathrm{QS}}

  • Quantization to signed 8-bit:

Mˉi,j=clip[127,127](round(Mi,j×s))\bar{\mathcal{M}}_{i,j} = \mathrm{clip}_{[-127,127]}\left( \mathrm{round}(\mathcal{M}_{i,j}\times s) \right)

  • Dequantization to FP32:

M^i,j=Mˉi,js=Mˉi,j QS127\widehat{\mathcal{M}}_{i,j} = \frac{\bar{\mathcal{M}}_{i,j}}{s} = \bar{\mathcal{M}}_{i,j}\ \frac{\mathrm{QS}}{127}

These formulas ensure that zero remains exactly representable (zero-point is fixed at 0).

1.2 Outlier-Robust Clipping: TM-IQR

Direct use of maxA\max|A| for activation range estimation causes degradation due to outlier sensitivity. The Token-Maximums Interquartile Range (TM-IQR) approach adapts Tukey's outlier rule for fast, per-inference outlier rejection:

  1. For each row (token) i=1Li=1 \ldots L in ARL×HA\in\mathbb{R}^{L\times H}, compute Mi=maxjAi,jM_i = \max_{j} |A_{i,j}|.
  2. Let q1,q3q_1, q_3 be the first and third quartiles of {M1,,ML}\{M_1,\ldots,M_L\}.
  3. Outlier threshold:

t=q3+1.5(q3q1)t = q_3 + 1.5(q_3 - q_1)

  1. Clip every Ai,jA_{i,j} to [t,t][-t, t].
  2. Recompute QS=maxi,jAi,j\mathrm{QS} = \max_{i,j} |A_{i,j}| on the clipped AA, then proceed as above.

The TM-IQR procedure runs in O(LlogL+LH)O(L \log L + L H) per activation, adding less than 2% throughput overhead in practice.

2. Mathematical Formulation

2.1. Dynamic Range and Scale

QS=max1im,1jnMi,j,s=127QS\mathrm{QS} = \max_{1\leq i\leq m,\,1\leq j\leq n} |\mathcal{M}_{i,j}|,\qquad s = \frac{127}{\mathrm{QS}}

2.2. Quantization and Dequantization

Mˉi,j=clip[127,127](round(sMi,j)),M^i,j=Mˉi,js\bar{\mathcal{M}}_{i,j} = \mathrm{clip}_{[-127,127]}\left( \mathrm{round}(s \mathcal{M}_{i,j}) \right),\quad \widehat{\mathcal{M}}_{i,j} = \frac{\bar{\mathcal{M}}_{i,j}}{s}

2.3. TM-IQR Outlier Threshold

Mi=max1jHAi,j,q1=1st quartile,q3=3rd quartileM_i = \max_{1\leq j \leq H} |A_{i,j}|,\quad q_1 = \text{1st quartile},\quad q_3 = \text{3rd quartile}

t=q3+1.5(q3q1),Ai,jmin(max(Ai,j,t),t)t = q_3 + 1.5(q_3 - q_1),\quad A_{i,j} \leftarrow \min(\max(A_{i,j}, -t), t)

2.4. Integer Matrix Multiplication (GEMM) and Rescaling

Given quantized inputs Aˉ\bar{A}, Wˉ\bar{W}:

Zint32=Aˉ×intWˉ    ZFP32=Zint321sAsWZ^{\text{int32}} = \bar{A} \times_{\text{int}} \bar{W} \implies Z^{\text{FP32}} = Z^{\text{int32}} \cdot \frac{1}{s_A s_W}

3. Integration with Transformer Layer Execution

A standard Transformer encoder layer involves multi-head attention (two GEMMs), normalization layers, and a feedforward subnetwork (FFN) with two GEMMs flanking a nonlinearity (ReLU/GELU).

  • At model-load time: All weight matrices WW are quantized to Wˉ\bar{W} using the precomputed scale sWs_W.
  • At run time, for each quantized GEMM:
    • The input activation AA is taken.
    • If the GEMM is the second FFN GEMM (the step contributing most quantization error), TM-IQR outlier clipping is applied.
    • Compute the activation scale sAs_A on (possibly clipped) AA.
    • Quantize activations AAˉA \rightarrow \bar{A}.
    • Perform int8 GEMM (Aˉ×Wˉ\bar{A} \times \bar{W}), accumulate in int32.
    • Dequantize output using 1/(sAsW)1/(s_A s_W).

In principle, TM-IQR could be applied before every GEMM, but empirical results show restricting it to the FFN's second GEMM achieves most of the available improvement with minimal extra cost.

4. Experimental Results

4.1 Hardware and Throughput

  • 48-core Intel Xeon 8260 CPU:
    • int8 GEMM without TM-IQR: 29,005 words/sec
    • int8 GEMM with TM-IQR: 28,640 words/sec (less than 2% slowdown)
  • NVIDIA V100 (fp16 reference): 71,998 words/sec

4.2 Benchmarks and Models

  • GLUE suite (MNLI, MNLI-MM, CoLA, SST-2, MRPC, STS-B, QQP, QNLI, RTE)
    • Models: BERT-base-cased, BERT-large-cased, RoBERTa-base, RoBERTa-large
  • Question Answering: TyDI-QA (11 languages), Natural Questions
    • Models: XLM-R-base, XLM-R-large

4.3 Results Summary

Model / Dataset FP32 int8 w/o TM-IQR +TM-IQR Δ vs FP32
BERT-base 82.0 80.3 81.8 –0.2
BERT-large 84.4 83.0 84.0 –0.4
RoBERTa-base 82.0 75.2 80.8 –1.2
RoBERTa-large 87.1 86.6 86.7 –0.4
TyDI-base (XLM-R) 67.7 62.9 67.0 –0.7
TyDI-large 68.8 66.8 68.4 –0.4
NQ-base 54.6 48.0 53.4 –1.2
NQ-large 56.6 53.3 56.1 –0.5
  • Memory footprint reduction is roughly 4×4\times (8-bit vs. 32-bit).
  • Inference throughput is hardly impacted by TM-IQR dynamic quantization.

5. Limitations and Prospective Directions

  • The TM-IQR rule uses a fixed multiplier (1.5) for all layers and activation types, which may not be optimal for all cases; tuning or learning the value may yield further robustness.
  • Currently, only the FFN's second GEMM is subject to dynamic outlier clipping; more widespread application (e.g., all GEMMs) could close the accuracy gap further, albeit with higher runtime cost.
  • Symmetric quantization with zero zero-point may be suboptimal for strictly positive activations (as in ReLU/GELU), and asymmetric quantization could be explored.
  • Mixed precision and per-channel quantization remain unexplored in this scheme, presenting future avenues for model compression.
  • Incorporating a lightweight, unlabeled calibration stage to tune the IQR multiplier, or extending the TM-IQR quantization approach to attention weights and outputs, are possible future enhancements.

In summary, zero-shot dynamic quantization enables automatic 8-bit inference of off-the-shelf Transformer models via a simple yet effective statistical outlier clipping mechanism, achieving a balance between memory reduction and accuracy preservation, all without calibration data or retraining (El-Kurdi et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero-Shot Dynamic Quantization.