W8A8 Static Quantization in Neural Networks
- W8A8 per-tensor static quantization is a method that maps both weights and activations to 8-bit integers using a single calibration-determined scale per tensor.
- It enhances hardware efficiency by enabling INT8 GEMM optimizations that reduce memory bandwidth and inference latency, despite sensitivity to extreme outliers.
- Recent advances such as activation-weight equalization, TWEO loss, and channel-flattening mitigate outlier issues, preserving model accuracy with minimal precision loss.
W8A8 per-tensor static quantization refers to the process of quantizing both the weights (W) and activations (A) of a neural network, typically a transformer or diffusion model, to 8-bit integer representations. The "per-tensor" attribute indicates that a single quantization scale (and, if used, zero-point) is applied to each tensor—weight matrix or activation block—rather than fine-grained schemes such as per-channel or per-token. "Static" calibration denotes that these scales are determined on a calibration set prior to inference and remain fixed during deployment. W8A8 per-tensor static quantization is highly hardware-efficient, unlocking the use of specialized INT8 GEMM kernels and reducing memory bandwidth, but it is sensitive to activation and weight outliers, motivating an array of mitigation strategies in recent research.
1. Quantization Fundamentals and Formulation
In W8A8 per-tensor static quantization, the primary objective is to map real-valued weights and activations into the range , using a shared scale per tensor with zero-point typically fixed at zero. For a tensor (either weight or activation), and for bit-width : In asymmetric cases (mainly in some audio/vision models), the scale and zero-point are: Most contemporary LLM and vision transformer pipelines, as well as the vLLM and CUTLASS INT8 kernels, use symmetric quantization with (Zhang et al., 28 Feb 2024, Liang et al., 28 Nov 2025, Kurtic et al., 4 Nov 2024, Zhao et al., 4 Jun 2024, Khandelwal et al., 30 Sep 2025, Li et al., 2023).
Typically, separate scaling factors are computed for weights and activations, each determined once per tensor by their extrema on a calibration set. This one-scale-per-tensor approach minimizes kernel overhead and maximizes throughput on modern accelerators. The essential workflow for per-tensor static W8A8 quantization is:
| Step | Description |
|---|---|
| Calibration | Scan calibration set for per-tensor maxima/minima |
| Scale computation | Compute single scale per tensor as above |
| Quantization | Map weights/activations to 8-bit integers with scale |
| Inference | Perform INT8 GEMMs, rescaling outputs as needed |
2. Challenges: Outliers and Quantization Error
The primary limitation of per-tensor static quantization is its vulnerability to outliers. In transformers and diffusion models, a small number of activations or weights can be orders of magnitude larger than the median, leading to excessive scale values. This substantially reduces the representational precision for most elements, causing degeneration in model quality, as nearly all non-extreme values are mapped to zero or a few discrete levels (Liang et al., 28 Nov 2025, Son et al., 17 Jun 2024, Zhang et al., 28 Feb 2024). For example, naïve per-tensor W8A8 quantization of activations in GPT-2 Medium results in perplexity degrading from 16.8 (BF16) to 1,450, and for vision transformers, baseline W8A8 drops accuracy from 80%+ to as low as 40% (Liang et al., 28 Nov 2025, Son et al., 17 Jun 2024).
Table: Effect of Activation Outliers on Naïve W8A8
| Model | BF16/FP16 Quality | Naïve W8A8 Quality |
|---|---|---|
| GPT-2 XL (PPL) | 13.84 | 1,872 |
| ViT-B (Top-1 acc.) | 80.29% | 40.16% |
This pattern is observed across large LLMs (OPT, LLaMA, GPT-2) and vision/audio diffusion models (Zhang et al., 28 Feb 2024, Liang et al., 28 Nov 2025, Khandelwal et al., 30 Sep 2025).
3. Mitigating Outliers: Structural and Algorithmic Advances
Recent research has proposed several methods to address the outlier problem and make W8A8 per-tensor static quantization viable at scale:
- Activation-Weight Equalization (AWEQ): Introduces a channelwise rescaling so that each weight and its corresponding activation channel have similar dynamic ranges before quantization. The calibration phase computes per-channel equalization coefficients, which are folded into the respective tensors, followed by standard per-tensor static quantization. AWEQ also adds a bias correction term to further reduce quantization-induced output bias (Li et al., 2023).
- TWEO Loss (Transformers Without Extreme Outliers): Incorporates a loss regularizer that penalizes only the extreme tails of the activation distribution during training, suppressing outliers from >10,000 to <20 (typical threshold τ=3, exponent p=4). This ensures that all activations are well within the quantizable range, enabling lossless per-tensor W8A8 quantization without architectural change or mixed precision (Liang et al., 28 Nov 2025).
- CushionCache Prefixing (Attention Sinks): Proposes a two-stage process: (i) a greedy prefix token search minimizing downstream activation maxima, and (ii) prefix tuning optimizing both prediction and quantization loss. The cached key-value vectors from the discovered prefix are injected into each transformer layer at inference, massively reducing activation outliers and thus improving W8A8 quantization fidelity (Son et al., 17 Jun 2024).
- Channel-Flattening (FlattenQuant): Slices large channels into multiple virtual channels capped at a fixed dynamic range threshold , dramatically tightening the per-tensor maximum. This allows even lower bit-widths (e.g., 4 bits) with minimal information loss but also enhances vanilla W8A8 quantization by reducing quantization error (Zhang et al., 28 Feb 2024).
Other works propose denoising-timestep-aware smoothing and low-rank adapters, especially in audio/vision transformers, to further address temporal/channel outliers (Khandelwal et al., 30 Sep 2025, Zhao et al., 4 Jun 2024).
4. Empirical Performance and Hardware Impact
Recent large-scale studies confirm that, when outlier mitigation is employed, per-tensor static W8A8 quantization can match or nearly match full-precision model quality (typically <1–3% accuracy drop) with substantial gains:
- Accuracy: For models such as Llama-3.1-8B/70B/405B, static per-tensor W8A8-INT recovers 97–101% of Open LLM Leaderboard accuracy or within <0.5 perplexity points of BF16 on text generation benchmarks (Kurtic et al., 4 Nov 2024, Liang et al., 28 Nov 2025). Vision transformers quantized with TWEO recover >98% of top-1 ImageNet accuracy (Liang et al., 28 Nov 2025).
- Speed and Cost: Inference is accelerated by 1.2–4× and memory footprint shrinks by 2–4×, depending on model size and hardware. For example, Llama-3.1-405B latency per token drops from ~27.7s (BF16) to ~8.3s (W8A8) with a 4× cost reduction (Kurtic et al., 4 Nov 2024).
- Latency and Throughput: INT8 GEMM kernels are fully exploited with per-tensor scales; INT8 matmuls yield ~2× speedup over FP16, and cost-per-query is materially reduced in scaled deployments (Zhang et al., 28 Feb 2024, Zhao et al., 4 Jun 2024).
- Memory: Weight and activation memory both decrease by a factor of four versus BF16 when INT8 is adopted for both (Liang et al., 28 Nov 2025, Zhang et al., 28 Feb 2024).
Table: Representative W8A8 Static Quantization Results
| Model | Task/Metric | BF16/FP16 | W8A8 Static | W8A8 + Outlier Mitigation | Speedup |
|---|---|---|---|---|---|
| Llama-3.1-70B | Acc. (%) | 99.9 | 98.8 | 99.9 (TWEO/Cache) | 1.5–4× |
| GPT-2 XL | Perplexity | 13.84 | 1,872 | 13.09 (TWEO) | 3–4× |
| ViT-B (87M) | Top-1 (%) | 80.29 | 40.16 | 80.29 (TWEO) | 2× |
| Audio DiT | CLAP score | 0.3009 | 0.2934 | – | 2–3× |
| PixArt-α (image gen) | FID | 73.34 | 75.61 | – | 1.47× |
5. Implementation Variants and Best Practices
Several configurations and implementation details are prevalent in recent literature:
- Symmetric Per-Tensor Quantization: Dominates for fused INT8 kernels; no per-channel scale/zero-point, maximizing hardware compatibility (Kurtic et al., 4 Nov 2024, Liang et al., 28 Nov 2025, Zhang et al., 28 Feb 2024).
- Calibration Strategy: Use small but diverse calibration sets (1–2K domain-relevant samples) to extract maxima; percentile clipping is optional and sometimes beneficial (Kurtic et al., 4 Nov 2024).
- Equalization and Smoothing: Pre-quantization rescaling to balance weights and activations (AWEQ, ViDiT-Q, SmoothQuant) or flattening methods to cap outliers in a structured way (FlattenQuant) (Zhang et al., 28 Feb 2024, Li et al., 2023, Zhao et al., 4 Jun 2024).
- Inference Kernel Design: All major frameworks (vLLM, QServe, FasterTransformer) exploit TensorCore or similar INT8 GEMMs with one scale per matrix/tensor, avoiding extra runtime quantization steps (Zhang et al., 28 Feb 2024, Kurtic et al., 4 Nov 2024).
Table: Major Outlier Mitigation Techniques Compatible with Static W8A8
| Method | Approach | Primary Paper |
|---|---|---|
| TWEO-Loss | Training-time outlier suppression | (Liang et al., 28 Nov 2025) |
| CushionCache | Prefix KV cache for activations | (Son et al., 17 Jun 2024) |
| AWEQ | Channel equalization, debias | (Li et al., 2023) |
| FlattenQuant | Channel flattening for activations | (Zhang et al., 28 Feb 2024) |
| SmoothQuant | Channel balancing | (Khandelwal et al., 30 Sep 2025, Zhao et al., 4 Jun 2024) |
6. Limitations and Future Directions
While W8A8 per-tensor static quantization is now effective for large-scale LLM and ViT models under outlier mitigation, some limitations remain:
- Activation Outlier Vulnerability: When outlier removal (via TWEO, CushionCache, or flattening) is not feasible, the approach remains brittle, with catastrophic degradation on complex distributions (Liang et al., 28 Nov 2025, Son et al., 17 Jun 2024).
- Fine-Grained Alternatives: Per-channel or per-group quantization, as used in some vision/audio models, can mildly improve fidelity but at the cost of increased kernel/memory complexity (Khandelwal et al., 30 Sep 2025, Zhao et al., 4 Jun 2024).
- Dynamic Activation Ranges: Some audio and vision transformers now use dynamic per-token/timestep quantization for activations, which can further close the gap to full-precision but relinquishes some static efficiency advantages (Khandelwal et al., 30 Sep 2025, Zhao et al., 4 Jun 2024).
- Scalability: On extremely large sequence lengths or atypical architectures, even advanced mitigation may not capture all quantization-induced pathologies.
A plausible implication is that further research may leverage hybrid techniques—training-time regularization, runtime prefixing, as well as hardware co-design—in pursuit of lossless precision scaling alongside maximal throughput and memory savings.
7. Practical Recommendations and Deployment Experience
For deployment on modern accelerators, the following practical guidelines are supported by large-scale empirical studies:
- Model Preparation: Apply outlier mitigation (e.g., TWEO, CushionCache) as a compulsory step for LLMs and ViTs (Liang et al., 28 Nov 2025, Son et al., 17 Jun 2024).
- Calibration: Use domain-relevant calibration data for both weights and activations; avoid overreliance on random/uniform data (Kurtic et al., 4 Nov 2024).
- Kernel Selection: Favor symmetric per-tensor quantization to maximally exploit hardware INT8 GEMM kernels (Zhang et al., 28 Feb 2024, Kurtic et al., 4 Nov 2024).
- Latency and Cost Optimization: W8A8 per-tensor static quantization is highly recommended for synchronous and asynchronous deployments on A100/H100-class GPUs, balancing minimal accuracy loss with 1.2–4× cost and memory gains (Kurtic et al., 4 Nov 2024, Zhang et al., 28 Feb 2024).
- Fallback Mechanisms: If training-time outlier suppression is infeasible, consider offline prefix cache or channel-equalization as mitigation (Son et al., 17 Jun 2024, Li et al., 2023).
References: (Zhang et al., 28 Feb 2024, Liang et al., 28 Nov 2025, Khandelwal et al., 30 Sep 2025, Son et al., 17 Jun 2024, Li et al., 2023, Kurtic et al., 4 Nov 2024, Zhao et al., 4 Jun 2024)