Papers
Topics
Authors
Recent
Search
2000 character limit reached

FP8 Quantization: Efficacy & Performance

Updated 2 April 2026
  • The paper demonstrates that FP8 quantization using E4M3/E5M2 formats enables nearly lossless training and inference in large neural networks.
  • It details quantization schemes and dynamic scaling strategies that minimize error and deliver up to 2× memory and throughput gains.
  • Empirical benchmarks confirm that FP8 reduces the memory footprint while robustly handling outlier-heavy value distributions in various models.

Floating-point 8-bit (FP8) quantization is a low-precision arithmetic framework that leverages 8-bit floating-point formats—typically E4M3 (4 exponent, 3 mantissa bits) and E5M2 (5 exponent, 2 mantissa bits)—to reduce memory, accelerate arithmetic, and enable full-stack training and inference of large neural networks. Contemporary hardware (NVIDIA Hopper/Blackwell, Intel Gaudi2/3) supports native FP8 computation, unlocking the potential for up to 2× improvements in throughput and 2× reductions in memory over 16-bit baselines. The efficacy of FP8, both as a training and inference primitive, is contingent on error introduced by low precision, quantization procedures, scaling strategies, and the interaction with network distributions and architectural outliers. Recent research has established FP8 as a near-lossless alternative to BF16 and FP16 in large-scale training, and a dominant post-training quantization (PTQ) method for inference when carefully engineered.

1. FP8 Format Specifications and Error Characteristics

FP8 formats allocate the 8 bits as 1 sign, E exponent, and M mantissa bits (E+M=7), and two primary variants are widely adopted:

Format Exponent bits Mantissa bits Exponent bias Max normal value Machine epsilon
E4M3 4 3 7 448 0.125
E5M2 5 2 15 57,344 0.25

E4M3 maximizes in-range representability at the expense of dynamic range; E5M2 extends dynamic range but at the cost of mantissa precision. The representable range of E4M3 spans ≈[1.6×10-2, 4.48×102], suitable for NLP activations and weights with heavy outliers, while E3M4 or E2M5 offer finer resolution for computer vision models with tight value distributions (Shen et al., 2023, Zhang et al., 2023).

Quantization in FP8 is nonuniform: values are concentrated near zero with exponentially spaced bins for large values, which is advantageous when modeling activations or weights with long tails, common in transformer architectures. The worst-case relative quantization error for normal numbers is bounded by ±2{-M-1} (e.g., 6.25% for E3M4, 12.5% for E2M5).

Under post-training quantization, empirical mean squared error (MSE) for FP8 is minimized compared to INT8 in the presence of outliers, as the exponential bins can represent extreme values without large clipping error (Kuzmin et al., 2022, Shen et al., 2023).

2. Quantization Schemes and Scaling Strategies

The efficacy of FP8 quantization is highly dependent on scaling granularity (per-layer, per-channel, per-group, per-token) and dynamic vs. static calibration:

For optimizer states and Adam moments, advanced designs such as "Dynamic Range Expansion" (DRE) precondition the value distributions to maximize FP8 bin utilization, reducing quantization error and enabling both first and second moments to be stored in FP8 without instability (Xi et al., 2024, Fishman et al., 2024). In COAT, a nonlinear map expands each group's dynamic range and rescales such that the effective range matches the FP8 representable set, reducing the update-ratio MSE by up to 1.63× over naïve quantization.

Activation quantization employs mixed-granularity: per-tensor static scaling in linear layers for kernel efficiency, and per-group scaling in nonlinearities to address the higher quantization error from activation outliers in deep transformers (Xi et al., 2024, Wang et al., 4 Nov 2025).

3. Empirical Efficacy and Benchmark Evaluation

Across large-scale LLMs (including OLMo, Llama-2/3, DeepSeek, and Bloom), vision-LLMs (VILA), and extensive CV/NLP/PTQ tasks, recent FP8 stacks report end-to-end model accuracy and convergence indistinguishable from BF16/FP16 baselines:

Model/Task Baseline (BF16/FP32) FP8 Result Deviation
Llama 70B, Academic Benchmarks 84.40–86.79 84.16–86.89 <0.2% (Kurtic et al., 2024)
OLMo-7B, Pretraining PPL (BF16) (FP8) ≤0.1 PPL (Xi et al., 2024)
BERT-Base/Large, GLUE 84–86 (MNLI) 84–85 (FP8) ≤0.2 (Li et al., 2023)
Vision Transformers, ImageNet 81.3 81.3–81.4 ≈0.1 (Liang et al., 28 Nov 2025)

Post-training quantization with FP8 (E4M3/E5M2), when combined with per-channel/per-group scaling and proper range calibration, achieves ≤1% accuracy drop on 92.6% of NLP/CV workloads, outstripping INT8 (65.9% pass rate) (Shen et al., 2023). In LLMs and transformers with heavy-tailed distributions, FP8 quantization is effectively lossless in both weight-activation (W8A8) and hybrid scenarios (Kurtic et al., 2024, Zhang et al., 2023, Li et al., 2023).

Crucially, for activation distributions with extreme outliers (common in transformers), naive FP8 can diverge or massively degrade performance. Techniques such as TWEO loss regularization (mechanically suppressing heavy tails) (Liang et al., 28 Nov 2025) and architectural modifications (Smooth-SwiGLU (Fishman et al., 2024)) restore stability, enabling full-model FP8 training with performance on par with or exceeding BF16.

4. Computational and Memory Efficiency Gains

FP8 reduces the memory footprint for weights, activations, optimizer states, and KV caches by up to 2×, enabling entire large models (e.g., Llama-2-7B full-parameter) to fit on a single 80GB H100—where BF16 would otherwise OOM (Xi et al., 2024). COAT achieves up to a 1.55× reduction in peak memory and a 1.43× end-to-end training speedup versus BF16, with the gains matching or exceeding NVIDIA TransformerEngine (Xi et al., 2024). FireQ reports inference speedups of up to 1.68× over W4A8-INT in FFN throughput for Llama2 (2505.20839), while SnapMLA achieves 1.91× decoding throughput in long-context MLA tasks (Zhang et al., 11 Feb 2026).

On hardware with high MFU for FP8 (e.g., Gaudi2), end-to-end throughput gains approach 2× for large matrices, with dynamic scaling and block-wise access required to maintain accuracy (Lee et al., 13 Mar 2025, Kim et al., 3 Feb 2025). RL rollouts with FP8 W8A8 and compressed KV caches demonstrate up to 44% throughput gain in long-context autoregressive generation, with learning curves tracking BF16 once token-level importance-sampling correction is used (Qiu et al., 26 Jan 2026).

5. Robustness, Security, and Limitations

FP8 quantization offers increased resistance to parameter fault injection attacks such as "gradient-guided bit-flip jailbreaks" in aligned LLMs: empirical attack success rates of <15% at 25 bit-flips (versus >80% for FP16), and <65% at 150 flips, outperforming INT8 and INT4 in resilience (Zahran et al., 4 Jul 2025). However, transferred attacks from higher-precision quantization (FP16→FP8) are not erased, and comprehensive hardware-level protections remain necessary.

Training stability is sensitive to activation outliers; extended training (trillion tokens) can reveal catastrophic instability in standard FP8 unless mitigations like Smooth-SwiGLU are applied (Fishman et al., 2024). Deployment for inference on edge devices is generally not recommended due to the higher compute overhead of FP8 arithmetic versus INT8; dedicated INT8 accelerators (area, power, and latency) remain preferable for hardware-constrained settings (Baalen et al., 2023).

6. Comparative Assessment with Alternative Quantization

FP8 outperforms INT8 in scenarios where input distributions have large outliers or long tails, while INT8 remains competitive—or slightly superior—for low-outlier, tightly centered data (e.g., vision models) (Zhang et al., 2023, Baalen et al., 2023, Shen et al., 2023, Zhang et al., 2023). Mixed-format PTQ—dynamically selecting INT8/FP8 per layer via MSE minimization—achieves state-of-the-art results with no hardware overhead, shrinking the accuracy gap in hybrid tasks (Zhang et al., 2023, Zhang et al., 2023).

Under quantization-aware training (QAT), differences among low-bit formats diminish; all converge to within 0.3% of FP32 accuracy (Kuzmin et al., 2022). The principal gain of FP8 is therefore in post-hoc quantization and efficient full-stack training/inference where retraining is infeasible or architectures are maximally memory/compute bound.

7. Practical Implementation and Deployment Recommendations

In summary, FP8 quantization, when paired with principled scaling, groupwise expansion, and architectural-aware mitigation of outliers, enables nearly lossless training and inference across diverse domains, significantly advances memory- and compute-efficiency, and generalizes to reinforcement learning and long-context attention with minimal impact on empirical performance (Xi et al., 2024, Wang et al., 26 Sep 2025, Kurtic et al., 2024, Liang et al., 28 Nov 2025, Lee et al., 13 Mar 2025, Zhang et al., 11 Feb 2026, Qiu et al., 26 Jan 2026). FP8's efficacy is thus firmly established for modern large-model workloads in both training and production.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FP8 Quantization Efficacy.