FP8 Quantization: Efficacy & Performance
- The paper demonstrates that FP8 quantization using E4M3/E5M2 formats enables nearly lossless training and inference in large neural networks.
- It details quantization schemes and dynamic scaling strategies that minimize error and deliver up to 2× memory and throughput gains.
- Empirical benchmarks confirm that FP8 reduces the memory footprint while robustly handling outlier-heavy value distributions in various models.
Floating-point 8-bit (FP8) quantization is a low-precision arithmetic framework that leverages 8-bit floating-point formats—typically E4M3 (4 exponent, 3 mantissa bits) and E5M2 (5 exponent, 2 mantissa bits)—to reduce memory, accelerate arithmetic, and enable full-stack training and inference of large neural networks. Contemporary hardware (NVIDIA Hopper/Blackwell, Intel Gaudi2/3) supports native FP8 computation, unlocking the potential for up to 2× improvements in throughput and 2× reductions in memory over 16-bit baselines. The efficacy of FP8, both as a training and inference primitive, is contingent on error introduced by low precision, quantization procedures, scaling strategies, and the interaction with network distributions and architectural outliers. Recent research has established FP8 as a near-lossless alternative to BF16 and FP16 in large-scale training, and a dominant post-training quantization (PTQ) method for inference when carefully engineered.
1. FP8 Format Specifications and Error Characteristics
FP8 formats allocate the 8 bits as 1 sign, E exponent, and M mantissa bits (E+M=7), and two primary variants are widely adopted:
| Format | Exponent bits | Mantissa bits | Exponent bias | Max normal value | Machine epsilon |
|---|---|---|---|---|---|
| E4M3 | 4 | 3 | 7 | 448 | 0.125 |
| E5M2 | 5 | 2 | 15 | 57,344 | 0.25 |
E4M3 maximizes in-range representability at the expense of dynamic range; E5M2 extends dynamic range but at the cost of mantissa precision. The representable range of E4M3 spans ≈[1.6×10-2, 4.48×102], suitable for NLP activations and weights with heavy outliers, while E3M4 or E2M5 offer finer resolution for computer vision models with tight value distributions (Shen et al., 2023, Zhang et al., 2023).
Quantization in FP8 is nonuniform: values are concentrated near zero with exponentially spaced bins for large values, which is advantageous when modeling activations or weights with long tails, common in transformer architectures. The worst-case relative quantization error for normal numbers is bounded by ±2{-M-1} (e.g., 6.25% for E3M4, 12.5% for E2M5).
Under post-training quantization, empirical mean squared error (MSE) for FP8 is minimized compared to INT8 in the presence of outliers, as the exponential bins can represent extreme values without large clipping error (Kuzmin et al., 2022, Shen et al., 2023).
2. Quantization Schemes and Scaling Strategies
The efficacy of FP8 quantization is highly dependent on scaling granularity (per-layer, per-channel, per-group, per-token) and dynamic vs. static calibration:
- Per-tensor/per-layer scaling: Maximize throughput (hardware-accelerated power-of-two scaling, >98% MFU on Gaudi2), appropriate for linear layers (Lee et al., 13 Mar 2025, Kim et al., 3 Feb 2025).
- Per-channel/group scaling: Reduces quantization noise, critical for transformer MLPs and outlier-prone layers, with a modest throughput cost (Shen et al., 2023, Zhang et al., 2023).
- Block-wise/token-wise scaling: Used in hybrid-granularity strategies to optimize for both efficiency and numerical fidelity, especially in activation quantization (Wang et al., 26 Sep 2025).
For optimizer states and Adam moments, advanced designs such as "Dynamic Range Expansion" (DRE) precondition the value distributions to maximize FP8 bin utilization, reducing quantization error and enabling both first and second moments to be stored in FP8 without instability (Xi et al., 2024, Fishman et al., 2024). In COAT, a nonlinear map expands each group's dynamic range and rescales such that the effective range matches the FP8 representable set, reducing the update-ratio MSE by up to 1.63× over naïve quantization.
Activation quantization employs mixed-granularity: per-tensor static scaling in linear layers for kernel efficiency, and per-group scaling in nonlinearities to address the higher quantization error from activation outliers in deep transformers (Xi et al., 2024, Wang et al., 4 Nov 2025).
3. Empirical Efficacy and Benchmark Evaluation
Across large-scale LLMs (including OLMo, Llama-2/3, DeepSeek, and Bloom), vision-LLMs (VILA), and extensive CV/NLP/PTQ tasks, recent FP8 stacks report end-to-end model accuracy and convergence indistinguishable from BF16/FP16 baselines:
| Model/Task | Baseline (BF16/FP32) | FP8 Result | Deviation |
|---|---|---|---|
| Llama 70B, Academic Benchmarks | 84.40–86.79 | 84.16–86.89 | <0.2% (Kurtic et al., 2024) |
| OLMo-7B, Pretraining PPL | (BF16) | (FP8) | ≤0.1 PPL (Xi et al., 2024) |
| BERT-Base/Large, GLUE | 84–86 (MNLI) | 84–85 (FP8) | ≤0.2 (Li et al., 2023) |
| Vision Transformers, ImageNet | 81.3 | 81.3–81.4 | ≈0.1 (Liang et al., 28 Nov 2025) |
Post-training quantization with FP8 (E4M3/E5M2), when combined with per-channel/per-group scaling and proper range calibration, achieves ≤1% accuracy drop on 92.6% of NLP/CV workloads, outstripping INT8 (65.9% pass rate) (Shen et al., 2023). In LLMs and transformers with heavy-tailed distributions, FP8 quantization is effectively lossless in both weight-activation (W8A8) and hybrid scenarios (Kurtic et al., 2024, Zhang et al., 2023, Li et al., 2023).
Crucially, for activation distributions with extreme outliers (common in transformers), naive FP8 can diverge or massively degrade performance. Techniques such as TWEO loss regularization (mechanically suppressing heavy tails) (Liang et al., 28 Nov 2025) and architectural modifications (Smooth-SwiGLU (Fishman et al., 2024)) restore stability, enabling full-model FP8 training with performance on par with or exceeding BF16.
4. Computational and Memory Efficiency Gains
FP8 reduces the memory footprint for weights, activations, optimizer states, and KV caches by up to 2×, enabling entire large models (e.g., Llama-2-7B full-parameter) to fit on a single 80GB H100—where BF16 would otherwise OOM (Xi et al., 2024). COAT achieves up to a 1.55× reduction in peak memory and a 1.43× end-to-end training speedup versus BF16, with the gains matching or exceeding NVIDIA TransformerEngine (Xi et al., 2024). FireQ reports inference speedups of up to 1.68× over W4A8-INT in FFN throughput for Llama2 (2505.20839), while SnapMLA achieves 1.91× decoding throughput in long-context MLA tasks (Zhang et al., 11 Feb 2026).
On hardware with high MFU for FP8 (e.g., Gaudi2), end-to-end throughput gains approach 2× for large matrices, with dynamic scaling and block-wise access required to maintain accuracy (Lee et al., 13 Mar 2025, Kim et al., 3 Feb 2025). RL rollouts with FP8 W8A8 and compressed KV caches demonstrate up to 44% throughput gain in long-context autoregressive generation, with learning curves tracking BF16 once token-level importance-sampling correction is used (Qiu et al., 26 Jan 2026).
5. Robustness, Security, and Limitations
FP8 quantization offers increased resistance to parameter fault injection attacks such as "gradient-guided bit-flip jailbreaks" in aligned LLMs: empirical attack success rates of <15% at 25 bit-flips (versus >80% for FP16), and <65% at 150 flips, outperforming INT8 and INT4 in resilience (Zahran et al., 4 Jul 2025). However, transferred attacks from higher-precision quantization (FP16→FP8) are not erased, and comprehensive hardware-level protections remain necessary.
Training stability is sensitive to activation outliers; extended training (trillion tokens) can reveal catastrophic instability in standard FP8 unless mitigations like Smooth-SwiGLU are applied (Fishman et al., 2024). Deployment for inference on edge devices is generally not recommended due to the higher compute overhead of FP8 arithmetic versus INT8; dedicated INT8 accelerators (area, power, and latency) remain preferable for hardware-constrained settings (Baalen et al., 2023).
6. Comparative Assessment with Alternative Quantization
FP8 outperforms INT8 in scenarios where input distributions have large outliers or long tails, while INT8 remains competitive—or slightly superior—for low-outlier, tightly centered data (e.g., vision models) (Zhang et al., 2023, Baalen et al., 2023, Shen et al., 2023, Zhang et al., 2023). Mixed-format PTQ—dynamically selecting INT8/FP8 per layer via MSE minimization—achieves state-of-the-art results with no hardware overhead, shrinking the accuracy gap in hybrid tasks (Zhang et al., 2023, Zhang et al., 2023).
Under quantization-aware training (QAT), differences among low-bit formats diminish; all converge to within 0.3% of FP32 accuracy (Kuzmin et al., 2022). The principal gain of FP8 is therefore in post-hoc quantization and efficient full-stack training/inference where retraining is infeasible or architectures are maximally memory/compute bound.
7. Practical Implementation and Deployment Recommendations
- Training: Apply hybrid granularity quantization, per-block (for weights) and per-token/group (for activations). Use dynamic range expansion for optimizer state quantization (Xi et al., 2024, Wang et al., 26 Sep 2025). For models with SwiGLU or similar nonlinearities, adopt Smooth-SwiGLU/activation regularization (Fishman et al., 2024, Liang et al., 28 Nov 2025).
- Inference: Use hardware-native FP8 (E4M3/E5M2) for all linear layers, per-channel scaling for weights, per-token/tensor scaling for activations (Baalen et al., 2023, Lee et al., 13 Mar 2025). For transformers and LLMs, preserve sensitive head layers in BF16 if necessary (Kim et al., 3 Feb 2025, Kurtic et al., 2024).
- Frameworks: Integrate direct FP8 dataflows with minimal dequantize/requantize operations, using casting-aware transpose and fused operators to prevent double quantization errors (Wang et al., 4 Nov 2025).
- Hardware: Favor accelerators that support full FP8 computation and FP32 accumulation (e.g., Hopper, Gaudi2+) for maximum efficiency (Kim et al., 3 Feb 2025, Lee et al., 13 Mar 2025, Kurtic et al., 2024).
In summary, FP8 quantization, when paired with principled scaling, groupwise expansion, and architectural-aware mitigation of outliers, enables nearly lossless training and inference across diverse domains, significantly advances memory- and compute-efficiency, and generalizes to reinforcement learning and long-context attention with minimal impact on empirical performance (Xi et al., 2024, Wang et al., 26 Sep 2025, Kurtic et al., 2024, Liang et al., 28 Nov 2025, Lee et al., 13 Mar 2025, Zhang et al., 11 Feb 2026, Qiu et al., 26 Jan 2026). FP8's efficacy is thus firmly established for modern large-model workloads in both training and production.