Papers
Topics
Authors
Recent
Search
2000 character limit reached

FP8 Calculations: Formats, Quantization & Architectures

Updated 4 June 2026
  • FP8 calculations are low-precision, 8-bit floating-point operations that partition a byte into sign, exponent, and significand to represent real values.
  • They employ quantization strategies, including group-wise and dynamic scaling, to align the tensor range with FP8 limits in deep learning and HPC applications.
  • Hardware implementations use mixed-precision and integer-based techniques to accelerate computation, balancing throughput gains against precision and efficiency tradeoffs.

An 8-bit floating-point (FP8) format refers to a family of low-precision, IEEE-inspired number representations in which a single byte is divided among a sign bit, exponent field, and significand (mantissa) field. Recent advances in both hardware and software have made FP8 arithmetic highly relevant for efficient deep learning training and inference, high-performance computing, and edge deployment. The FP8 calculation ecosystem now encompasses multiple formats, quantization and scaling strategies, hardware-accelerated arithmetic, mixed-precision kernels, and complete end-to-end workflows.

1. FP8 Format Definitions and Numerical Properties

The canonical FP8 format is defined as follows: a single 8-bit word comprises 1 sign bit, E exponent bits, and M = 7 – E mantissa (fraction) bits, with an exponent bias B=2E11B = 2^{E-1} - 1 (Micikevicius et al., 2022, Kim et al., 3 Feb 2025, Baalen et al., 2023).

The real value encoded by an FP8 bit pattern [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0] is:

  • For normal values (1e2E21 \leq e \leq 2^E - 2):

f=(1)s2eB(1+m2M)f = (-1)^s \cdot 2^{e - B} \cdot \left(1 + \frac{m}{2^M}\right)

  • For subnormals (e=0e=0):

f=(1)s21Bm2Mf = (-1)^s \cdot 2^{1-B} \cdot \frac{m}{2^M}

  • e=2E1e = 2^E-1 is used for special values (NaN, \infty).

Two widely adopted FP8 standards are E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits) (Micikevicius et al., 2022, Kuzmin et al., 2022, Kim et al., 3 Feb 2025):

Format Exponent Bits (E) Mantissa Bits (M) Bias Min Subnormal Min Normal Max Normal
E4M3 4 3 7 292^{-9} 262^{-6} 448
E5M2 5 2 15 [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0]0 [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0]1 57344

Machine epsilon (relative rounding error) is [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0]2. Thus, E4M3: [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0]3; E5M2: [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0]4 (Micikevicius et al., 2022, Shen et al., 2023). Some hardware implements full IEEE compliance for E5M2; E4M3 often omits separate [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0]5 encoding, using extra bit patterns for extended normals (Micikevicius et al., 2022).

2. Quantization, Scaling, and Conversion Pipelines

FP8 quantization relies on matching tensor dynamic range to the representable FP8 range and, when necessary, locally adjusting scale factors (Kim et al., 3 Feb 2025, Shen et al., 2023, Wang et al., 4 Nov 2025). The basic quantization (applied per-tensor, per-row, per-group, or per-channel) is:

  • Compute a scaling factor [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0]6, where [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0]7 is the largest normal FP8 value for the format in use.
  • Quantize: [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0]8.
  • Dequantize: [seE1e0mM1m0][s\,|\,e_{E-1}\ldots e_0\,|\,m_{M-1}\ldots m_0]9.

Post-training quantization (PTQ) computes these scales using a small calibration set and experimentally observed maxima; quantization-aware training (QAT) may allow the scale (and, in some cases, the effective mantissa bits) to be learned during optimization, leveraging straight-through estimators to enable gradient flow (Kuzmin et al., 2022, Shen et al., 2023).

Dynamic or group-wise scaling, as employed in frameworks like COAT, further matches FP8's dynamic range to the tensor (Xi et al., 2024). In optimal cases, “unit scaling” exploits architectural invariance to select fixed scales (e.g., 1e2E21 \leq e \leq 2^E - 20 per layer) (Narayan et al., 9 Feb 2025).

3. FP8 Arithmetic and Kernel Implementations

FP8 multiply-accumulate (MAC) and matrix-matrix multiply (GEMM) implementations typically cast operands to FP8, execute in higher-precision accumulators (FP16, BF16, or FP32), and, if required, cast the result back to FP8 (Jarmusch et al., 10 Feb 2026, Hernández-Cano et al., 26 May 2025, Baalen et al., 2023). The conversion to FP8 utilizes rounding-to-nearest-even, with saturation to special values at overflow. Intensive workflows (e.g., in LLMs or MoE models) employ blockwise or tilewise quantization to maximize hardware occupancy and minimize double-quantization error (Wang et al., 4 Nov 2025).

FP8 arithmetic can also be implemented directly with pure integer logic (integer-based add, shift, multiply), significantly reducing silicon area and critical path on FPGAs or ASICs (Lindberg et al., 2024). Some neuromorphic approaches achieve bit-exact FP8 arithmetic by mapping arithmetic and rounding to threshold logic circuits in spatial combinational pipelines (Tang, 8 Dec 2025).

In high-performance computing, FP64 computations can be emulated using FP8 Tensor Cores via the Ozaki scheme—splitting operands into precisely re-scaled components to realize error-free transformations, followed by FP8 GEMMs and reconstructing the result with higher-precision accumulation (Mukunoki, 1 Aug 2025, Uchino et al., 11 Mar 2026).

4. Error Analysis, Precision/Range Tradeoffs, and Suitability

The primary mathematical tradeoff for FP8 is between dynamic range (exponent bits) and precision (mantissa bits) (Kuzmin et al., 2022, Micikevicius et al., 2022, Shen et al., 2023). E4M3 provides finer quantization near zero, more suitable for weights and activations with low variance, while E5M2 (and even higher-exponent formats) cover wider dynamic range, better for gradients, optimizer states, or outlier-plagued activations. For distributions with heavy tails (such as those in transformer activations), increasing exponent bits lowers MSE—network architecture and data distribution should govern format selection (Kuzmin et al., 2022).

Empirical studies across 75 architectures show FP8 PTQ outperforms INT8 in quantization error and end-to-end accuracy, with E4M3 working best for NLP, E3M4 for CV (Shen et al., 2023).

5. Hardware Acceleration and Execution Characteristics

Recent accelerators (NVIDIA Hopper/H100, Intel Gaudi 2, AMD MI300A) natively support both E4M3 and E5M2 FP8 kernels (Jarmusch et al., 10 Feb 2026, Kim et al., 3 Feb 2025). These devices achieve up to 2× throughput/TFLOPS and 1.8× power efficiency compared to FP16, but actual gains are limited by occupancy, memory bandwidth, and kernel tiling strategies. On AMD MI300A, FP8 MFMA instructions with FP32 accumulation are available; maximum throughput is reached when large numbers (≥256) of active wavefronts are sustained (Jarmusch et al., 10 Feb 2026). For small batch sizes or "thin" GEMMs (common in decode-stage LLM inference), achievable FP8 performance is often less than hardware peak.

FP8 hardware, however, can be 50–180% less efficient than INT8 in terms of pure compute throughput, especially for inference; thus, INT8 remains preferable for edge-centric inference deployments (Baalen et al., 2023). On the other hand, FP8 offers critical advantages for training and high-dynamic-range workloads.

6. Post-training Quantization, Outlier Handling, and Hybrid Kernels

FP8 quantization workflows, as validated in FireQ, integrate outlier smoothing, channel-wise scaling, and RoPE-aware normalization to maintain accuracy under aggressive quantization—crucial for LLMs with rotary positional embeddings (2505.20839). Mixed-precision kernels, e.g., INT4 weights with FP8 activations, can further optimize bandwidth and performance.

Advanced methods, such as dynamic range expansion via companding (COAT) or mixed-precision quantization by per-tensor/activation regime, substantially reduce quantization-induced error while enabling end-to-end FP8 computation (including optimizer states and large layer activations) (Xi et al., 2024).

7. Applications, Conversion, and Limitations

FP8 arithmetic is now widespread in LLM training (matching or exceeding BF16 in speed and downstream accuracy at scale), federated learning (offering 2.9× communication savings over FP32), and scientific computing (enabling 8–53× acceleration for FP64 emulation) (Xi et al., 2024, Wang et al., 2024, Uchino et al., 11 Mar 2026). When converting FP8-trained networks to INT8 for inference, post-training quantization without retraining is feasible, but can incur 50–180% compute efficiency loss unless the workload is specifically optimized for integer ops (Baalen et al., 2023).

FP8 formats are not universally superior: for latency-sensitive inference, low-occupancy kernels, or edge inference with INT-only hardware, INT8 remains preferred. Moreover, relative errors (machine epsilon) are substantially higher than for FP16/BF16, which places inherent accuracy limits on FP8 for outlier-prone or numerically unstable models (Baalen et al., 2023, Micikevicius et al., 2022).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FP8 Calculations.