Papers
Topics
Authors
Recent
2000 character limit reached

W8A8 Static Quantization in Neural Networks

Updated 5 December 2025
  • W8A8 per-tensor static quantization is a method that maps both weights and activations to 8-bit integers using a single calibration-determined scale per tensor.
  • It enhances hardware efficiency by enabling INT8 GEMM optimizations that reduce memory bandwidth and inference latency, despite sensitivity to extreme outliers.
  • Recent advances such as activation-weight equalization, TWEO loss, and channel-flattening mitigate outlier issues, preserving model accuracy with minimal precision loss.

W8A8 per-tensor static quantization refers to the process of quantizing both the weights (W) and activations (A) of a neural network, typically a transformer or diffusion model, to 8-bit integer representations. The "per-tensor" attribute indicates that a single quantization scale (and, if used, zero-point) is applied to each tensor—weight matrix or activation block—rather than fine-grained schemes such as per-channel or per-token. "Static" calibration denotes that these scales are determined on a calibration set prior to inference and remain fixed during deployment. W8A8 per-tensor static quantization is highly hardware-efficient, unlocking the use of specialized INT8 GEMM kernels and reducing memory bandwidth, but it is sensitive to activation and weight outliers, motivating an array of mitigation strategies in recent research.

1. Quantization Fundamentals and Formulation

In W8A8 per-tensor static quantization, the primary objective is to map real-valued weights and activations into the range [127,127][-127, 127], using a shared scale per tensor with zero-point typically fixed at zero. For a tensor xx (either weight or activation), and for bit-width b=8b=8: s=max(x)2b11,q=round(x/s),x^=sclip(q,2b1,2b11)s = \frac{\max(|x|)}{2^{b-1} - 1},\quad q = \mathrm{round}(x/s),\quad \hat{x} = s \cdot \mathrm{clip}(q, -2^{b-1}, 2^{b-1}-1) In asymmetric cases (mainly in some audio/vision models), the scale and zero-point are: s=max(x)min(x)2b1,z=round(min(x)s),q=round(xs)+zs = \frac{\max(x)-\min(x)}{2^b-1},\quad z = \mathrm{round}\left(-\frac{\min(x)}{s}\right),\quad q = \mathrm{round}\left(\frac{x}{s}\right) + z Most contemporary LLM and vision transformer pipelines, as well as the vLLM and CUTLASS INT8 kernels, use symmetric quantization with z=0z=0 (Zhang et al., 28 Feb 2024, Liang et al., 28 Nov 2025, Kurtic et al., 4 Nov 2024, Zhao et al., 4 Jun 2024, Khandelwal et al., 30 Sep 2025, Li et al., 2023).

Typically, separate scaling factors are computed for weights and activations, each determined once per tensor by their extrema on a calibration set. This one-scale-per-tensor approach minimizes kernel overhead and maximizes throughput on modern accelerators. The essential workflow for per-tensor static W8A8 quantization is:

Step Description
Calibration Scan calibration set for per-tensor maxima/minima
Scale computation Compute single scale per tensor as above
Quantization Map weights/activations to 8-bit integers with scale
Inference Perform INT8 GEMMs, rescaling outputs as needed

2. Challenges: Outliers and Quantization Error

The primary limitation of per-tensor static quantization is its vulnerability to outliers. In transformers and diffusion models, a small number of activations or weights can be orders of magnitude larger than the median, leading to excessive scale values. This substantially reduces the representational precision for most elements, causing degeneration in model quality, as nearly all non-extreme values are mapped to zero or a few discrete levels (Liang et al., 28 Nov 2025, Son et al., 17 Jun 2024, Zhang et al., 28 Feb 2024). For example, naïve per-tensor W8A8 quantization of activations in GPT-2 Medium results in perplexity degrading from 16.8 (BF16) to 1,450, and for vision transformers, baseline W8A8 drops accuracy from 80%+ to as low as 40% (Liang et al., 28 Nov 2025, Son et al., 17 Jun 2024).

Table: Effect of Activation Outliers on Naïve W8A8

Model BF16/FP16 Quality Naïve W8A8 Quality
GPT-2 XL (PPL) 13.84 1,872
ViT-B (Top-1 acc.) 80.29% 40.16%

This pattern is observed across large LLMs (OPT, LLaMA, GPT-2) and vision/audio diffusion models (Zhang et al., 28 Feb 2024, Liang et al., 28 Nov 2025, Khandelwal et al., 30 Sep 2025).

3. Mitigating Outliers: Structural and Algorithmic Advances

Recent research has proposed several methods to address the outlier problem and make W8A8 per-tensor static quantization viable at scale:

  1. Activation-Weight Equalization (AWEQ): Introduces a channelwise rescaling so that each weight and its corresponding activation channel have similar dynamic ranges before quantization. The calibration phase computes per-channel equalization coefficients, which are folded into the respective tensors, followed by standard per-tensor static quantization. AWEQ also adds a bias correction term to further reduce quantization-induced output bias (Li et al., 2023).
  2. TWEO Loss (Transformers Without Extreme Outliers): Incorporates a loss regularizer that penalizes only the extreme tails of the activation distribution during training, suppressing outliers from >10,000 to <20 (typical threshold τ=3, exponent p=4). This ensures that all activations are well within the quantizable range, enabling lossless per-tensor W8A8 quantization without architectural change or mixed precision (Liang et al., 28 Nov 2025).
  3. CushionCache Prefixing (Attention Sinks): Proposes a two-stage process: (i) a greedy prefix token search minimizing downstream activation maxima, and (ii) prefix tuning optimizing both prediction and quantization loss. The cached key-value vectors from the discovered prefix are injected into each transformer layer at inference, massively reducing activation outliers and thus improving W8A8 quantization fidelity (Son et al., 17 Jun 2024).
  4. Channel-Flattening (FlattenQuant): Slices large channels into multiple virtual channels capped at a fixed dynamic range threshold TT, dramatically tightening the per-tensor maximum. This allows even lower bit-widths (e.g., 4 bits) with minimal information loss but also enhances vanilla W8A8 quantization by reducing quantization error (Zhang et al., 28 Feb 2024).

Other works propose denoising-timestep-aware smoothing and low-rank adapters, especially in audio/vision transformers, to further address temporal/channel outliers (Khandelwal et al., 30 Sep 2025, Zhao et al., 4 Jun 2024).

4. Empirical Performance and Hardware Impact

Recent large-scale studies confirm that, when outlier mitigation is employed, per-tensor static W8A8 quantization can match or nearly match full-precision model quality (typically <1–3% accuracy drop) with substantial gains:

  • Accuracy: For models such as Llama-3.1-8B/70B/405B, static per-tensor W8A8-INT recovers 97–101% of Open LLM Leaderboard accuracy or within <0.5 perplexity points of BF16 on text generation benchmarks (Kurtic et al., 4 Nov 2024, Liang et al., 28 Nov 2025). Vision transformers quantized with TWEO recover >98% of top-1 ImageNet accuracy (Liang et al., 28 Nov 2025).
  • Speed and Cost: Inference is accelerated by 1.2–4× and memory footprint shrinks by 2–4×, depending on model size and hardware. For example, Llama-3.1-405B latency per token drops from ~27.7s (BF16) to ~8.3s (W8A8) with a 4× cost reduction (Kurtic et al., 4 Nov 2024).
  • Latency and Throughput: INT8 GEMM kernels are fully exploited with per-tensor scales; INT8 matmuls yield ~2× speedup over FP16, and cost-per-query is materially reduced in scaled deployments (Zhang et al., 28 Feb 2024, Zhao et al., 4 Jun 2024).
  • Memory: Weight and activation memory both decrease by a factor of four versus BF16 when INT8 is adopted for both (Liang et al., 28 Nov 2025, Zhang et al., 28 Feb 2024).

Table: Representative W8A8 Static Quantization Results

Model Task/Metric BF16/FP16 W8A8 Static W8A8 + Outlier Mitigation Speedup
Llama-3.1-70B Acc. (%) 99.9 98.8 99.9 (TWEO/Cache) 1.5–4×
GPT-2 XL Perplexity 13.84 1,872 13.09 (TWEO) 3–4×
ViT-B (87M) Top-1 (%) 80.29 40.16 80.29 (TWEO)
Audio DiT CLAP score 0.3009 0.2934 2–3×
PixArt-α (image gen) FID 73.34 75.61 1.47×

5. Implementation Variants and Best Practices

Several configurations and implementation details are prevalent in recent literature:

Table: Major Outlier Mitigation Techniques Compatible with Static W8A8

Method Approach Primary Paper
TWEO-Loss Training-time outlier suppression (Liang et al., 28 Nov 2025)
CushionCache Prefix KV cache for activations (Son et al., 17 Jun 2024)
AWEQ Channel equalization, debias (Li et al., 2023)
FlattenQuant Channel flattening for activations (Zhang et al., 28 Feb 2024)
SmoothQuant Channel balancing (Khandelwal et al., 30 Sep 2025, Zhao et al., 4 Jun 2024)

6. Limitations and Future Directions

While W8A8 per-tensor static quantization is now effective for large-scale LLM and ViT models under outlier mitigation, some limitations remain:

  • Activation Outlier Vulnerability: When outlier removal (via TWEO, CushionCache, or flattening) is not feasible, the approach remains brittle, with catastrophic degradation on complex distributions (Liang et al., 28 Nov 2025, Son et al., 17 Jun 2024).
  • Fine-Grained Alternatives: Per-channel or per-group quantization, as used in some vision/audio models, can mildly improve fidelity but at the cost of increased kernel/memory complexity (Khandelwal et al., 30 Sep 2025, Zhao et al., 4 Jun 2024).
  • Dynamic Activation Ranges: Some audio and vision transformers now use dynamic per-token/timestep quantization for activations, which can further close the gap to full-precision but relinquishes some static efficiency advantages (Khandelwal et al., 30 Sep 2025, Zhao et al., 4 Jun 2024).
  • Scalability: On extremely large sequence lengths or atypical architectures, even advanced mitigation may not capture all quantization-induced pathologies.

A plausible implication is that further research may leverage hybrid techniques—training-time regularization, runtime prefixing, as well as hardware co-design—in pursuit of lossless precision scaling alongside maximal throughput and memory savings.

7. Practical Recommendations and Deployment Experience

For deployment on modern accelerators, the following practical guidelines are supported by large-scale empirical studies:

References: (Zhang et al., 28 Feb 2024, Liang et al., 28 Nov 2025, Khandelwal et al., 30 Sep 2025, Son et al., 17 Jun 2024, Li et al., 2023, Kurtic et al., 4 Nov 2024, Zhao et al., 4 Jun 2024)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to W8A8 Per-Tensor Static Quantization.