FP8 Quantization in Deep Neural Networks
- FP8 Quantization is a numerical method using 8-bit floating-point formats that allocates bits to sign, exponent, and mantissa to balance dynamic range and precision.
- It offers configurable layouts (e.g., E5M2, E4M3, E3M4) where trade-offs between exponent range and mantissa precision are optimized for various DNN layer distributions.
- Adaptive calibration and hybrid quantization strategies enable near-lossless accuracy, reduced memory footprint, and improved throughput in training and inference.
FP8 quantization is a numerical representation and quantization methodology employing 8-bit floating-point formats for efficient deep neural network (DNN) training and inference. By allocating a small number of bits among sign, exponent, and mantissa, FP8 enables a non-uniform grid of representable values, providing greater dynamic range than fixed-point integer formats of the same bitwidth and offering improved robustness to heavy-tailed and outlier-prone data distributions common in large modern models. FP8 is currently seeing broad adoption in hardware (e.g., NVIDIA Hopper, Gaudi 2), software training and inference pipelines, and is the subject of significant research into low-precision learning, datacenter cost optimization, fault resilience, and mixed-precision strategies across a range of application domains.
1. FP8 Format Structure and Rationale
FP8 denotes a family of 8-bit floating-point formats, each defined by the allocation of its bits to sign, exponent, and mantissa. Common layouts include E5M2 (1 sign, 5 exponent, 2 mantissa bits), E4M3 (1 sign, 4 exponent, 3 mantissa bits), and E3M4 (1 sign, 3 exponent, 4 mantissa bits) (Micikevicius et al., 2022, Shen et al., 2023). The floating-point value can be formalized as: where is the sign, is the encoded exponent, is the exponent bias, the number of mantissa bits, and the -th mantissa bit (Kuzmin et al., 2022).
Format selection trades dynamic range against representable precision within each exponent interval: more exponent bits enlarge range, favoring the retention of outlier values; more mantissa bits refine grid density, reducing quantization error for values near zero. Hardware implementations may relax strict IEEE-754 semantics (e.g., E4M3 reuses encoding slots for dynamic range rather than infinities), optimizing utility for DNN workloads where NaN/Inf support is less critical (Micikevicius et al., 2022).
Subnormals and flexible biasing schemes further extend representable range near zero—crucial in representing distributions with many small-magnitude values (Zhang et al., 2023, Aggarwal et al., 2023).
Format | Exponent (E) | Mantissa (M) | Max Magnitude | Subnormals | Non-uniformity | Typical Use |
---|---|---|---|---|---|---|
E5M2 | 5 | 2 | Large | Yes | Medium | Gradients |
E4M3 | 4 | 3 | Moderate | Yes | Higher | Activations |
E3M4 | 3 | 4 | Small | Yes | Highest | Weights (vision) |
2. Analytical Properties and Error Dynamics
FP8 quantization errors are fundamentally distinct from those of INT8. The quantization grid of FP8 is exponentially spaced, enabling the representation of extremely large and small values with lower risk of overflow and underflow than INT8, which maintains a uniform grid across its entire range (Kuzmin et al., 2022, Shen et al., 2023).
Mean squared error (MSE) analysis demonstrates that FP8 can achieve lower reconstruction error for heavy-tailed or outlier-rich distributions (e.g., transformer activations, vision model features) by allocating more bits to the exponent field (Kuzmin et al., 2022):
- For Gaussian/“well-behaved” weight distributions, configurations such as 5M2E (more mantissa) perform best.
- For layers with significant outliers, increasing exponent bits (e.g., 4M3E, 3M4E, E4M3) yields best results by reducing hard clipping (Kuzmin et al., 2022, Shen et al., 2023). Optimal configuration is thus highly dependent on the distributional characteristics of each layer or tensor.
FP8’s non-uniform quantization acts as a form of “implicit outlier smoothing,” giving it a distinct advantage in DNN activations, LLM feedforward/attention outputs, and vision models with kurtotic feature statistics. This benefit is less pronounced for tensors with tightly concentrated distributions, where finer uniform quantization can suffice (Zhang et al., 2023).
3. Quantization Methodologies and Implementation Strategies
FP8 quantization typically follows a two-step workflow: calibration and conversion. Calibration determines the appropriate scaling factor (sometimes per-tensor/channel), such that: where is the tensor, is the maximum FP8 representable value (Shen et al., 2023, Li et al., 2023). Each value is then quantized as , and mapped into the FP8 grid. For training and mixed-precision backpropagation, higher-precision “master” weights, optimizer states, or gradients are retained as necessary (Wang et al., 26 Sep 2025).
Adaptive frameworks are now prevalent. For instance, MoFQ (“Mixture of Formats Quantization”) determines the optimal quantization scheme (INT8 or FP8) for each layer to minimize error, employing a per-tensor dynamic analysis of quantization-induced MSE (Zhang et al., 2023). FGMP applies block-wise Fisher information-weighted scoring to assign the lowest viable precision (FP4 or FP8) for each block of weights/activations, tagging only the most sensitive regions for FP8 (Hooper et al., 19 Apr 2025).
COAT and related full-FP8 training frameworks address the optimizer state quantization challenge by dynamic range expansion transformations: This approach reshapes the distribution of moments to match FP8’s representable range, preserving optimizer fidelity and reducing quantization error (Xi et al., 25 Oct 2024).
Modern hardware support, such as NVIDIA Hopper and Intel Gaudi 2/3, allows direct computation in FP8, but most software still uses simulated quantization via custom FP8 emulators or C++/CUDA kernels (Shen et al., 2023, Lee et al., 13 Mar 2025).
4. Empirical Performance in Training and Inference
Extensive experimental results demonstrate that FP8 post-training quantization (PTQ) and quantization-aware training (QAT) often deliver near-lossless accuracy for a wide spectrum of workloads:
- Image classification (ResNet, ViT), NLP (BERT, LLaMA, GPT, translation models), and segmentation tasks all exhibit <1% degradation relative to FP16/BF16/FP32 baselines in most cases (Micikevicius et al., 2022, Shen et al., 2023, Li et al., 2023, Kurtic et al., 4 Nov 2024, Wang et al., 26 Sep 2025).
- Large LMs quantized with FP8 (both weights and activations) preserved 99–100% of full-precision benchmark performance even at >405B parameter scale (Kurtic et al., 4 Nov 2024, Wang et al., 26 Sep 2025).
- INT8 can offer comparable performance if hyperparameters are closely tuned; FP8 is more robust to outliers and requires less calibration (Kurtic et al., 4 Nov 2024).
- In modern LLM inference, thin GEMMs in the decode phase become bandwidth-bound; FP8 yields substantially higher throughput and energy efficiency than FP16/BF16 on accelerators optimized for FP8, directly impacting TCO (Kim et al., 3 Feb 2025, Lee et al., 13 Mar 2025).
FP8 displays clear superiority in quantizing activations, especially in layers where statistical outliers would otherwise cause saturations in INT8 (Zhang et al., 2023, Wu et al., 2023). Simultaneous quantization of optimizer states and activations further halves training memory footprint and can yield a ~1.5× speedup while maintaining “essentially lossless” accuracy (Xi et al., 25 Oct 2024).
5. Trade-offs, Limitations, and Failure Modes
While FP8 provides major efficiency and accuracy gains, specific limitations and trade-offs are observed:
- Training stability: FP8-only training is sensitive to dynamic range and loss surface sharpness. Early experiments reveal instability—particularly when exponent bits are insufficient—leading to frequent loss spikes or divergence (Lee et al., 29 May 2024). Instability is exacerbated by certain activations (notably SwiGLU), which can cause outlier amplification during long training runs (Fishman et al., 19 Sep 2024).
- Mitigation strategies include per-channel scaling, dynamic granularity (hybrid block- and token-wise scaling), and algorithmic interventions such as Smooth-SwiGLU, which rescales activations per-channel to avoid FP8 overflow (Fishman et al., 19 Sep 2024, Wang et al., 26 Sep 2025).
- Mixed-precision bookkeeping is crucial: gradients, optimizer moments, and accumulators are generally retained in higher-precision (FP16/BF16/FP32) to prevent catastrophic update loss (Wang et al., 26 Sep 2025).
- For smaller bit-widths (<4 bits), minifloats may underperform integer quantization; at 8 bits, FP8 is generally superior for outlier-prone and activation-rich tensors (Aggarwal et al., 2023).
- Some non-linear or normalization operations (LayerNorm, Softmax) retain better accuracy if kept in higher precision due to pronounced quantization sensitivity (Shen et al., 2023, Li et al., 2023).
- Security: FP8 quantization provides greater resilience against certain bit-level fault injection attacks than FP16 or INT4, lowering attack success rates, but does not fully eliminate the risk of transferability from compromised high-precision models (Zahran et al., 4 Jul 2025).
6. Application Domains and System-Level Implications
The adoption of FP8 quantization spans a wide range of deep learning domains:
- LLM inference and training: FP8 enables throughput improvements up to 34% versus BF16 in trillion-token-scale training, with memory savings approaching 30–50% (Fishman et al., 19 Sep 2024, Xi et al., 25 Oct 2024, Lee et al., 13 Mar 2025, Wang et al., 26 Sep 2025).
- Large computer vision models: FP8 (E4M3, E3M4) achieves near-lossless accuracy in classification and segmentation, with dynamic format selection preferred depending on task and outlier prevalence (Shen et al., 2023, Zhang et al., 2023).
- Video diffusion models: Advances in tile-wise FP8 quantization, coupled with structured sparsity, enable nearly 5× acceleration with preserved sample quality in high-resolution (720p) video generation (Liu et al., 5 Jun 2025).
- Datacenter deployment: As LLM inference is increasingly bandwidth-bound, FP8 is pivotal in maximizing compute utilization and reducing operational TCO, especially for long-context or real-time deployments (Kim et al., 3 Feb 2025).
- Mixed-precision, hybrid, and adaptive quantization strategies (MoFQ, FGMP, COAT) use FP8 as the fidelity-preserving “escape hatch” for sensitive blocks, further reducing average bitwidth and hardware cost without material accuracy loss (Zhang et al., 2023, Hooper et al., 19 Apr 2025, Xi et al., 25 Oct 2024).
FP8’s non-uniform, parameterized representation is highly compatible with automated, per-layer, or per-block selection frameworks, and is shaping the design of next-generation AI accelerators supporting dynamic, programmable precision at runtime (Zhang et al., 2023, Hooper et al., 19 Apr 2025). Publicly released toolkits such as Intel Neural Compressor and open-sourced FP8 reference implementations facilitate broad adoption in both research and production contexts (Shen et al., 2023, Fishman et al., 19 Sep 2024, Xi et al., 25 Oct 2024).
7. Outlook and Research Directions
Recent progress confirms FP8 as a robust and efficient quantization paradigm for DNN training and inference across domains, provided strategies for outlier handling, scaling, and block-wise adaptation are applied. For future directions:
- Adaptive and hybrid precision: Dynamic selection of quantization format (FP8/INT8/FP4) per-tensor, per-layer, or per-block, possibly with runtime reconfiguration, remains an active area (Zhang et al., 2023, Hooper et al., 19 Apr 2025).
- Full-stack co-design: Work such as FPSAttention demonstrates the coupling of FP8 quantization granularity with dataflow and sparsity patterns, enabling hardware-software codesign for maximal efficiency (Liu et al., 5 Jun 2025).
- Training stability and optimizer quantization: Addressing persistent instability or catastrophic loss spikes (notably with SwiGLU or for long runs) requires further advances in scaling, stochastic rounding, and precision-adaptive computation (Lee et al., 29 May 2024, Fishman et al., 19 Sep 2024).
- Security aspects: Although FP8 provides enhanced resistance to certain attacks, robustness against more sophisticated or transfer-based attacks needs further paper (Zahran et al., 4 Jul 2025).
- Open benchmarking and toolchains: Standardized toolkits (e.g., Intel Neural Compressor) and open-reference implementations (e.g., Megatron-DeepSpeed, COAT, FireQ) will facilitate transparent comparison and broader community adoption.
Overall, FP8 quantization delivers a compelling balance between accuracy, efficiency, and flexibility, and is rapidly becoming a central component in large-scale, high-performance neural network systems.