FP8 Arithmetic: Formats, Methods & Acceleration

Updated 27 May 2026

FP8 arithmetic is an 8-bit floating-point system that encodes real numbers using a sign, exponent, and mantissa with a non-uniform dynamic range, exemplified by E4M3 and E5M2 formats.
It enables efficient machine learning and DGEMM emulation by employing fused multiply-add operations, precise rounding, and affine quantization techniques.
Hardware implementations on NVIDIA, Intel, FPGA, and neuromorphic platforms leverage per-tensor scaling and optimized logic to balance energy savings with dynamic range and precision.

A floating-point 8-bit (FP8) arithmetic system refers to a family of number representations and computation algorithms using 8-bit words to encode signed real values with a non-uniform (logarithmic) dynamic range. FP8 arithmetic has emerged as a critical enabler for high-throughput machine learning training and inference, and for accelerating double-precision matrix multiplication (DGEMM) via low-precision emulation schemes on modern hardware. The canonical FP8 formats, E4M3 and E5M2, are supported on NVIDIA Hopper/Blackwell, Intel Gaudi, and Open Compute Project (OCP) accelerator platforms. Use cases range from neural post-training quantization to numerically precise scientific computation via emulation, and span both digital and nontraditional substrates (such as iontronic and neuromorphic). The following provides a rigorous exposition of FP8 arithmetic’s formats, algorithms, hardware, numerical properties, and empirical impact.

1. FP8 Number Format Definition and Encodings

The standard FP8 word encodes a real number as a triplet: sign (s), exponent (e), and mantissa/fraction (m), parameterized by exponent width E and mantissa width M, with 1+E+M=8. The most widespread encodings are:

Format	Sign	Exponent (bits)	Mantissa (bits)	Bias	Min Normal	Max Normal	Epsilon
E4M3	1	4	3	7	$2^{-6}$	$\sim$ 480	$2^{-3}$
E5M2	1	5	2	15	$2^{-14}$	$\sim$ 57,344	$2^{-2}$

Normal values ( $1 \leq e \leq 2^E-2$ ):

$x = (-1)^s \times 2^{e-\text{bias}} \times (1 + m/2^M)$

Subnormals ( $e=0$ , $m\neq0$ ):

$\sim$ 0

Special encodings follow IEEE-754 for E5M2 (full Inf/NaN support), while E4M3 omits Infs, repurposing the terminal exponent for an extra normal binade, enabling greater dynamic range at the expense of explicit infinity representation (Micikevicius et al., 2022). Subnormals permit gradual underflow and are critical for small magnitude representation (Zhang et al., 2023).

The FP8 dynamic range, precision, and quantization grid depend on exponent/mantissa split; high E/low M favors range over granularity and vice versa. E5M2 spans $\sim$ 1, ideal for dynamic gradients and Adam moments (Fishman et al., 2024), while E4M3’s $\sim$ 2 range is suited for weights/activations (Kuzmin et al., 2022).

2. Mathematical Operations and Rounding in FP8

Arithmetic on FP8 is realized by specialized or emulated fused multiply-add (FMA), addition, subtraction, and comparison operations. Implementations fall into three principal categories:

True digital floating-point arithmetic using IEEE-style normalization, exponent alignment, mantissa (1+M)-bit adders/multipliers, sticky and guard bits, and rounding to nearest-even. The rounding operation heavily limits effective accuracy, with final quantization error proportional to the unit roundoff $\sim$ 3.
Integer-based emulation for small mantissas, expressing FP8 arithmetic via integer adder/multiplier logic with conditional carry corrections for correctly rounded results under diverse rounding modes (round-to-nearest, toward zero, etc.) (Lindberg et al., 2024).
Spatial combinational and spiking approaches leveraging native logic gates in unconventional substrates, with bit-exact pipelines upholding IEEE-754 rounding and normalization by propagation of sticky signals and leading-zero detectors (Tang, 8 Dec 2025).

Pseudocode for “classical” addition and multiplication:

Addition:

Unpack and align exponents (shift smaller’s mantissa).
Add/subtract mantissas with sign.
Renormalize (possibly two-bit shift), adjust exponent.
Round to nearest-even, clamp for over/underflow, repack.

Multiplication:

e_out = e_A + e_B – bias.
m_out = (1 + m_A/2^M) * (1 + m_B/2^M).
Normalize, round to nearest-even, saturate as needed, repack (Micikevicius et al., 2022, Zhang et al., 2023).

3. Quantization, Dequantization, and Practical Deployment

FP8 is deployed primarily as a quantization target for full-precision data (commonly FP32 or BF16) in deep neural network weights, activations, and optimizer states (Li et al., 2023, Lee et al., 13 Mar 2025, 2505.20839). The standard approach is affine mapping via a scale $\sim$ 4:

Quantization: $\sim$ 5 (to nearest-even, saturate/clamp at representable extrema).
Dequantization: $\sim$ 6 (restoring approximate dynamic range).

Scales are typically selected as

$\sim$ 7

where the denominator is max normal (e.g., 240 for E4M3) (Choi et al., 28 Oct 2025, Lee et al., 13 Mar 2025). Per-tensor or per-channel scaling is routinely used to minimize clipping/underflow in high dynamic range tensors, with layer-wise (activation) and channel-wise (weight) granularity yielding the best empirical tradeoff for transformers and LLMs (Li et al., 2023, Lee et al., 13 Mar 2025).

FP8 quantization is especially robust to outliers due to its logarithmic level spacing, outperforming INT8 and bfloat16 on tasks with heavy-tailed tensor distributions (transformers, residual adds) (Kuzmin et al., 2022, Li et al., 2023, 2505.20839). Quantization-aware retraining is optional but narrows all performance gaps (Kuzmin et al., 2022).

4. Algorithmic Acceleration: DGEMM Emulation via FP8 (Ozaki Schemes)

Emerging high-performance processors exhibit much higher throughput for INT8/FP8 matrix multiply-accumulate (MMA) than for FP64 (Uchino et al., 11 Mar 2026, Mukunoki, 1 Aug 2025). Emulating double-precision (FP64) GEMM (DGEMM) is now possible and highly efficient with FP8 tensor cores using the Ozaki-I and Ozaki-II schemes:

Ozaki-II approach (optimal for FP8): FP64 A, B are scaled to large integers, decomposed into residue classes modulo coprime $\sim$ 8, split into FP8 fragments (via Karatsuba or modular reduction), MMA is performed entirely in low precision (FP8-in, FP32-accumulate, exact for practical $\sim$ 9), and results are recombined using Chinese Remainder Theorem, followed by inverse scaling. Guaranteed elementwise error is $2^{-3}$ 0.
For DGEMM via the Ozaki-II scheme, the number of required FP8 matrix multiplications is a strict multiple (e.g., 37 for N=12 moduli) less than in the best Ozaki-I mapping (121 for S=11 slices), with identical effective bit accuracy (Uchino et al., 11 Mar 2026).

Benchmarks show FP8 Ozaki-II emulation achieves $2^{-3}$ 1 on NVIDIA B200 (FP8 MMA throughput $2^{-3}$ 2), and that when FP8 hardware throughput is more than twice that of INT8, FP8-based Ozaki-II is optimal (Uchino et al., 11 Mar 2026).

5. Hardware Implementations and Acceleration Pathways

FP8 arithmetic is natively implemented in multiple hardware architectures:

NVIDIA H100, Blackwell, and Rubin tensor cores: E4M3/E5M2 formats with full IEEE or slightly modified handling, fused multiply-add engines accepting FP8 input, accumulating in FP16/FP32, and saturating or trapping on overflow.
Intel Gaudi 2: AVX10.2-compliant OFP8 E4M3/E5M2 pipeline supporting per-tensor/per-channel scale, high-precision accumulation, optimized power-of-two scaling fused into exponent adders for in-core efficiency (Lee et al., 13 Mar 2025).
FPGA and ASIC: Integer operations implement correctly rounded FP8 arithmetic with LUT-based carry-in logic, yielding $2^{-3}$ 3 faster, $2^{-3}$ 4 smaller resource multipliers compared to bit-serial designs; E4M3 requires more complex rounding logic than E5M2 but still fits in a per-stage LUT (Lindberg et al., 2024).
Experimental spatial neuromorphic platforms: Bit-exact E4M3 arithmetic is achievable through spatial combinational pipelines of integrate-and-fire neurons, with explicit sticky/guard bit propagation, yielding $2^{-3}$ 5 latency reduction for large linear layers relative to temporally coded SNNs, and full robustness to analog noise and leakage (Tang, 8 Dec 2025).

Synthesis and validation efforts confirm full alignment with PyTorch/numerical reference for all 16,129 E4M3 value pairs (normal/subnormal/NaN/Inf) (Tang, 8 Dec 2025).

6. Numerical Behavior, Empirical Performance, and Limitations

FP8 enables significant memory footprint reduction (2–4×), bandwidth scaling, and energy savings per operation ( $2^{-3}$ 6 for 8-bit FMA vs $2^{-3}$ 7 for FP16) (Hunhold et al., 29 Apr 2025). Empirical studies demonstrate:

End-to-end FP8 quantization (E4M3/E5M2) matches FP16/BF16 baselines for vision, language, and LLM tasks, typically within $2^{-3}$ 8 accuracy drop (e.g., BERT on GLUE: 84.6% FP32 vs. 84.6% FP8; ResNet-50 on ImageNet: 76.13% FP32 vs. 75.85% FP8) (Li et al., 2023, Huang et al., 2021, Micikevicius et al., 2022).
For FFT-based spectral computations, both E4M3 and E5M2 are unsuitable due to catastrophic overflow in all but extremely low-dynamic-range cases, providing a clear warning of FP8’s limitations outside deep learning (Hunhold et al., 29 Apr 2025).
In LLM fine-tuning and LoRA adapters, FP8 only improves net speed when quantization overhead is amortized (i.e., large GEMMs). For small matrices, quantization setup costs dominate (Choi et al., 28 Oct 2025).
In long-horizon FP8 training (trillion tokens), FP8-specific instabilities are observed in SwiGLU activations, rectified via “Smooth SwiGLU” per-channel scaling. Both moments of Adam can be safely quantized (E4M3/E5M2 respectively) only if appropriate dynamic range is preserved (Fishman et al., 2024).

7. Format Parameterization, Flexible Schemes, and Best Practices

FP8 arithmetic enables a spectrum of format choices, trading range for precision, or vice versa, as a function of application layer, distribution tails, and resource constraints (Zhang et al., 2023, Huang et al., 2021, Kuzmin et al., 2022):

Intense dynamic range requirements (gradients, Adam v): E5M2 or even narrower mantissa/broader exponent (E2M5, 2M5E).
Most activations, weights: E4M3, E3M4.
Strictly nonnegative ReLU activations: assign the sign bit to exponent/mantissa for even higher density.
Flexible (FFP8) formats permit layerwise (or tensorwise) optimizations of exponent and mantissa allocation, exponent bias, and sign presence for $2^{-3}$ 9 top-1 image classification loss without retraining, with only two lightweight bit converters at memory/computation boundaries (Huang et al., 2021).

Best practices include always using per-tensor or per-channel affine scalings, nearest-even rounding, maintaining high-precision accumulation in MACs, and benchmarking format tradeoffs on calibration data. For safety, non-GEMM (e.g., LayerNorm, GELU) ops are best kept in FP16/BF16 (Li et al., 2023, Micikevicius et al., 2022).

FP8 arithmetic has thus catalyzed a new design space for both AI and scientific hardware, facilitating deep compression, throughput scaling, and algorithm/hardware co-design. Its main limitations remain restricted precision for scientific reduction (without emulation), and catastrophic overflow for high-dynamic-range non-learned signals. Within coverage and with careful quantization, FP8 now stands as the default low-precision arithmetic for inference/training acceleration in large language and vision models, and is the enabling substrate for high-throughput FP64 emulation in emergent GPU and NPU architectures (Uchino et al., 11 Mar 2026, Li et al., 2023, Lee et al., 13 Mar 2025, Choi et al., 28 Oct 2025, Huang et al., 2021, Lindberg et al., 2024, Tang, 8 Dec 2025).