FP8 Formats in Deep Learning

Updated 4 March 2026

FP8 formats are 8-bit floating-point representations designed to optimize memory usage, arithmetic throughput, and energy efficiency in deep learning workloads.
They incorporate specialized quantization techniques like dynamic per-tensor scaling, static scaling, and squeeze-shift adaptation to maintain numerical stability and minimal accuracy loss.
Hardware accelerators such as NVIDIA H100 and Intel Gaudi 2 leverage native FP8 support to achieve higher throughput and energy efficiency compared to traditional FP16/FP32 operations.

A floating-point 8-bit (FP8) format is a binary representation for real numbers designed to optimize memory, arithmetic throughput, and energy efficiency in deep learning hardware. The move to FP8 from 16/32-bit formats has been driven by the need to accelerate both training and inference in deep neural networks—particularly LLMs and computer vision architectures—while maintaining numerical stability and minimal accuracy loss. A variety of FP8 variants, scaling strategies, and hardware support techniques have been developed to ensure their compatibility with diverse DNN workloads, yielding significant efficiency gains across production and research models.

1. FP8 Format Variants: Bit Allocation, Mathematical Definition, and Rationale

FP8 refers to any 8-bit floating-point format that encodes a real value as a triple: sign bit, exponent field of E bits, and mantissa (fraction) field of M bits, with E + M = 7. The two most widely adopted canonical formats are E4M3 and E5M2:

Format	Exponent Bits (E)	Mantissa Bits (M)	Bias	Maximum Normal Value	Min Normal Value	Machine Epsilon
E4M3	4	3	7	≈448	2^-6 ≈ 1.56e-2	2^-3 = 0.125
E5M2	5	2	15	≈5.73×10⁴	2^-14 ≈ 6.10e-5	2^-2 = 0.25
E3M4	3	4	3	≈15.5	2^-2	2^-4 = 0.0625
E2M5	2	5	1	≈3.94	2⁰ = 1	2^-5 = 0.03125

A normalized value $x$ is given by

$x = (-1)^s \cdot 2^{e - \text{bias}} \cdot \left(1 + \frac{m}{2^M}\right)$

with subnormals and saturation at the format limits, and the bias computed as $2^{E-1}-1$ . E5M2 provides substantially greater dynamic range for a given number of bits, at the expense of precision near zero; E4M3 has finer granularity near zero and suits data with less extreme outliers (Micikevicius et al., 2022, Zhang et al., 2023, Shen et al., 2023).

Subnormals are supported in most implementations (e.g., E4M3/E5M2), but E3M4/E2M5 are used in certain mixed-precision or task-adaptive schemes. The choice of exponent/mantissa partition is a direct trade-off between dynamic range (to avoid overflow/underflow) and granularity (to minimize rounding errors).

2. Quantization, Scaling, and Rounding Methods

Direct quantization to FP8 exhibits severe gradient underflow/overflow and rounding artifacts without careful scale management. Three main strategies dominate:

Dynamic Per-Tensor Scaling: For each tensor, a scale $s=2^{b_\text{scale}}$ is chosen so that the largest absolute value in the tensor fits inside FP8's representable range. This update can be per-microbatch or per-token and is computed as $b_\text{scale} = \left\lfloor \log_2(\text{FP8\_max} / \max |X|)\right\rfloor$ (Perez et al., 2023).
Static Scaling Based on Model Structure: The $\mu$ nit Scaling and unit scaling paradigms employ scale factors determined by first-principles analysis (e.g., $s = \sqrt{\text{fan-in}}$ for hidden activations/weights, $s = \text{fan-in}$ for gradients), obviating dynamic computation (Narayan et al., 9 Feb 2025, Blake et al., 2023).
Squeeze-Shift Adaptation: Shifted and Squeezed FP8 (S2FP8) uses learned per-tensor shift ( $\alpha$ ) and squeeze ( $\gamma$ ) parameters based on empirical statistics (mean, max in $\log_2$ space) to transform the tensor into the optimal FP8 window before quantization (Cambier et al., 2020).

Rounding to nearest, ties-to-even is standard, but stochastic rounding is critical for microgradients: changes below the least FP8-representable difference are randomly retained or discarded to preserve expectation, especially in SGD updates (Wang et al., 2018, Micikevicius et al., 2022).

3. Training and Inference: Workflow and Empirical Results

Training Regimes

Accumulator Precision: All regimes accumulate partial products in higher precision (FP16 or FP32) to avoid catastrophic precision loss ("swamping") from long dot-products. "Chunk-based accumulation" breaks long reductions into smaller groups for bounded error (Wang et al., 2018).
Mixed-Precision Strategy: Weights and activations typically use E4M3; gradients use E5M2 for greater range. Master copies of weights may be kept in FP16 or FP32, with casting to FP8 for matrix arithmetic (Micikevicius et al., 2022, Perez et al., 2023).
Loss and Gradient Scaling: Loss scaling or static/learned per-tensor scales are required to avoid gradient underflow (Wang et al., 2018, Perez et al., 2023, Cambier et al., 2020).

Empirically, FP8 training with these workflows achieves accuracy parity with FP16 or FP32 on diverse DNNs: ResNet-18/50 on ImageNet ([27.86% → 28.28%], <1% drop), Transformer/BERT models (<0.5% drop on SQuAD/GLUE), and GPT/Llama models (±0.2% validation accuracy on models up to 70B parameters) (Wang et al., 2018, Perez et al., 2023, Noune et al., 2022, Narayan et al., 9 Feb 2025). S2FP8 and similar per-tensor statistic adaptation methods enable "out-of-the-box" stability across tasks (Cambier et al., 2020, Blake et al., 2023).

Inference and Post-Training Quantization

FP8 PTQ achieves higher workload coverage than INT8, especially for NLP and generative models. On a benchmark suite of 75 networks, E4M3 static quantization reaches 92.6% pass rate (≤1% drop), compared to 65.9% for INT8 (Shen et al., 2023). In BERT-base (GLUE), INT8 PTQ collapses, while E4M3-based FP8 PTQ matches FP32 performance (Li et al., 2023). For CV, E3M4 is marginally more robust for weights and activations with low maximum magnitude. For generative models, FP8 preserves visual fidelity (FID ≈ FP32) unlike INT8 (Shen et al., 2023).

4. Hardware Implementation, Efficiency, and Accelerator Support

Design Considerations

FP8 Compute Units: Both NVIDIA Hopper/H100 and Intel Gaudi 2 natively support E4M3/E5M2 arithmetic with FP32 (or FP16) accumulation (Lee et al., 13 Mar 2025, Kim et al., 3 Feb 2025). Graphcore and RISC-V extensions provide similar support (Bertaccini et al., 2022).
Scale Application: Gaudi 2 exposes exponent-bias tweaks for efficient power-of-two scaling; NVIDIA H100 applies scaling via explicit multiplies unless vectors use exact powers (Lee et al., 13 Mar 2025, Kim et al., 3 Feb 2025).
Accumulator Design: MiniFloat-NN integrates a fused expanding sum-of-dot-products (ExSdotp) unit to accumulate FP8→FP16 in a single normalization operation, reducing area/critical path by ~30% (Bertaccini et al., 2022).
Compute-in-memory (CIM) Arrays: Flexible FP8 CIM accelerators dynamically adjust bitwidth with DSBP and use FIFO-based input alignment, achieving up to 77.9 TFLOPS/W for low-range modes and matching FP8-baseline accuracy with large multipliers (Zhao et al., 5 Feb 2026).

Efficiency and Utilization

FP8 multiply-accumulate units typically deliver:

1.5–2× the throughput and 2–4× the energy efficiency of FP16 on modern ASICs (Wang et al., 2018, Kim et al., 3 Feb 2025, Lee et al., 13 Mar 2025).
Up to 4× memory bandwidth reduction and 4× training speed/energy improvement compared to FP32 (Noune et al., 2022).
Area and gate count remain similar to INT8 with fixed-point accumulators, with total overhead <5% for mixed INT8/FP8 support (Zhang et al., 2023).

Operator-level GEMMs in FP8 can achieve >90% MFU, with FP8 throughput exceeding 800 TFLOPS on Gaudi 2 and delivering 1.5–2× speedup over BF16 in LLM inference (Lee et al., 13 Mar 2025, Kim et al., 3 Feb 2025). For datacenter TCO, FP8 reduces cost-per-inference by improving TFLOPS/W and lowering peak power draw during decode-dominated workload phases (Kim et al., 3 Feb 2025).

5. Comparative Analysis: FP8 versus INT8 and Specialized Schemes

FP8 offers a representational advantage for outlier-dominated or heavy-tailed distributions because its logarithmic quantization adapts to both small and large values with minimal quantization error in deep networks, compared to the uniform quantization of INT8, which is optimal only for bounded/uniform distributions (Kuzmin et al., 2022, Baalen et al., 2023, Shen et al., 2023). For discriminative inference, PTQ with FP8 E4M3 outperforms INT8 on model families like BERT and GPT (higher workload coverage, lower accuracy drop), while E3M4 and E2M5 accommodate high-precision or near-zero-data workloads in CV (Shen et al., 2023, Zhang et al., 2023).

Advanced schemes like microscaling (block-wise scaling per k=32 elements) and all-you-can-fit "FFP8" enable fine granularity in per-block dynamic range management, further reducing quantization error at extreme bitwidths (6/4 bits) without major algorithm modifications (Rouhani et al., 2023, Huang et al., 2021). Flexible mixed-precision strategies, searching per-layer over $\{\text{INT8},\text{E5M2},\text{E4M3},\text{E3M4},\text{E2M5}\}$ , achieve maximal accuracy recovery for arbitrary model architectures (Zhang et al., 2023).

6. Practical Guidelines, Limitations, and Current Recommendations

Format and Scaling Selection

E4M3 for activations/weights, especially in NLP/LLM; E5M2 for gradients or extreme outliers.
E3M4 or E2M5 in CV tasks or where precision near zero dominates accuracy.
Per-tensor or block-wise scaling, optionally dynamic; static scaling via $\mu$ nit Scaling or unit scaling if model architecture/initialization is well-controlled (Perez et al., 2023, Narayan et al., 9 Feb 2025, Blake et al., 2023, Rouhani et al., 2023).

Implementation

Use per-channel scaling for weights, per-tensor for activations. Exclude first/last layers from quantization or employ higher-precision as needed for stability (Shen et al., 2023).
Stochastic rounding is required in the SGD loop for reliable convergence (Wang et al., 2018, Micikevicius et al., 2022).
Hardware priorities: prefer architectures supporting native FP8 arithmetic, exponent-bias scaling, and fused multiply-accumulate with FP16/FP32 accumulators (Wang et al., 2018, Zhao et al., 5 Feb 2026, Bertaccini et al., 2022).

Limitations and Open Issues

Small or deep models with highly skewed or non-stationary tensor distributions may require layer-specific tuning or hybrid-precision schemes (Micikevicius et al., 2022, Cambier et al., 2020).
INT8 remains more efficient than FP8 for on-device edge inference regarding area and energy but has more limited dynamic range, posing issues for Transformer-type models (Baalen et al., 2023, Shen et al., 2023).
E4M3/E5M2 outlier resilience does not fully remove the risk of catastrophic overflow/underflow if scaling is misapplied or activation statistics drift substantially (Kuzmin et al., 2022, Zhang et al., 2023).

7. Future Directions and Emerging Standards

Recent advances focus on:

Block-wise scaling (microscaling) and groupwise bitwidth prediction to optimize storage and energy efficiency (Rouhani et al., 2023, Zhao et al., 5 Feb 2026).
Flexible, mixed-precision quantization frameworks across FP8/INT8 to minimize per-layer quantization error (Zhang et al., 2023).
End-to-end training at sub-8-bit with minimal recipe modification (MXFP6, S2FP8) and improved hardware support for per-block statistics (Huang et al., 2021, Rouhani et al., 2023, Cambier et al., 2020).
Datecenter-scale TCO optimization driven by decode-phase optimization, with FP8 accelerating LLM inference at high efficiency (Kim et al., 3 Feb 2025).

FP8 formats, particularly E4M3 and E5M2, are now integral parts of major commercial AI accelerator toolchains for both training and inference, with ongoing developments extending support for increasingly aggressive bitwidth reductions and adaptive mixed-precision architectures.