FP8 Calculations: Formats, Quantization & Architectures
- FP8 calculations are low-precision, 8-bit floating-point operations that partition a byte into sign, exponent, and significand to represent real values.
- They employ quantization strategies, including group-wise and dynamic scaling, to align the tensor range with FP8 limits in deep learning and HPC applications.
- Hardware implementations use mixed-precision and integer-based techniques to accelerate computation, balancing throughput gains against precision and efficiency tradeoffs.
An 8-bit floating-point (FP8) format refers to a family of low-precision, IEEE-inspired number representations in which a single byte is divided among a sign bit, exponent field, and significand (mantissa) field. Recent advances in both hardware and software have made FP8 arithmetic highly relevant for efficient deep learning training and inference, high-performance computing, and edge deployment. The FP8 calculation ecosystem now encompasses multiple formats, quantization and scaling strategies, hardware-accelerated arithmetic, mixed-precision kernels, and complete end-to-end workflows.
1. FP8 Format Definitions and Numerical Properties
The canonical FP8 format is defined as follows: a single 8-bit word comprises 1 sign bit, E exponent bits, and M = 7 – E mantissa (fraction) bits, with an exponent bias (Micikevicius et al., 2022, Kim et al., 3 Feb 2025, Baalen et al., 2023).
The real value encoded by an FP8 bit pattern is:
- For normal values ():
- For subnormals ():
- is used for special values (NaN, ).
Two widely adopted FP8 standards are E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits) (Micikevicius et al., 2022, Kuzmin et al., 2022, Kim et al., 3 Feb 2025):
| Format | Exponent Bits (E) | Mantissa Bits (M) | Bias | Min Subnormal | Min Normal | Max Normal |
|---|---|---|---|---|---|---|
| E4M3 | 4 | 3 | 7 | 448 | ||
| E5M2 | 5 | 2 | 15 | 0 | 1 | 57344 |
Machine epsilon (relative rounding error) is 2. Thus, E4M3: 3; E5M2: 4 (Micikevicius et al., 2022, Shen et al., 2023). Some hardware implements full IEEE compliance for E5M2; E4M3 often omits separate 5 encoding, using extra bit patterns for extended normals (Micikevicius et al., 2022).
2. Quantization, Scaling, and Conversion Pipelines
FP8 quantization relies on matching tensor dynamic range to the representable FP8 range and, when necessary, locally adjusting scale factors (Kim et al., 3 Feb 2025, Shen et al., 2023, Wang et al., 4 Nov 2025). The basic quantization (applied per-tensor, per-row, per-group, or per-channel) is:
- Compute a scaling factor 6, where 7 is the largest normal FP8 value for the format in use.
- Quantize: 8.
- Dequantize: 9.
Post-training quantization (PTQ) computes these scales using a small calibration set and experimentally observed maxima; quantization-aware training (QAT) may allow the scale (and, in some cases, the effective mantissa bits) to be learned during optimization, leveraging straight-through estimators to enable gradient flow (Kuzmin et al., 2022, Shen et al., 2023).
Dynamic or group-wise scaling, as employed in frameworks like COAT, further matches FP8's dynamic range to the tensor (Xi et al., 2024). In optimal cases, “unit scaling” exploits architectural invariance to select fixed scales (e.g., 0 per layer) (Narayan et al., 9 Feb 2025).
3. FP8 Arithmetic and Kernel Implementations
FP8 multiply-accumulate (MAC) and matrix-matrix multiply (GEMM) implementations typically cast operands to FP8, execute in higher-precision accumulators (FP16, BF16, or FP32), and, if required, cast the result back to FP8 (Jarmusch et al., 10 Feb 2026, Hernández-Cano et al., 26 May 2025, Baalen et al., 2023). The conversion to FP8 utilizes rounding-to-nearest-even, with saturation to special values at overflow. Intensive workflows (e.g., in LLMs or MoE models) employ blockwise or tilewise quantization to maximize hardware occupancy and minimize double-quantization error (Wang et al., 4 Nov 2025).
FP8 arithmetic can also be implemented directly with pure integer logic (integer-based add, shift, multiply), significantly reducing silicon area and critical path on FPGAs or ASICs (Lindberg et al., 2024). Some neuromorphic approaches achieve bit-exact FP8 arithmetic by mapping arithmetic and rounding to threshold logic circuits in spatial combinational pipelines (Tang, 8 Dec 2025).
In high-performance computing, FP64 computations can be emulated using FP8 Tensor Cores via the Ozaki scheme—splitting operands into precisely re-scaled components to realize error-free transformations, followed by FP8 GEMMs and reconstructing the result with higher-precision accumulation (Mukunoki, 1 Aug 2025, Uchino et al., 11 Mar 2026).
4. Error Analysis, Precision/Range Tradeoffs, and Suitability
The primary mathematical tradeoff for FP8 is between dynamic range (exponent bits) and precision (mantissa bits) (Kuzmin et al., 2022, Micikevicius et al., 2022, Shen et al., 2023). E4M3 provides finer quantization near zero, more suitable for weights and activations with low variance, while E5M2 (and even higher-exponent formats) cover wider dynamic range, better for gradients, optimizer states, or outlier-plagued activations. For distributions with heavy tails (such as those in transformer activations), increasing exponent bits lowers MSE—network architecture and data distribution should govern format selection (Kuzmin et al., 2022).
Empirical studies across 75 architectures show FP8 PTQ outperforms INT8 in quantization error and end-to-end accuracy, with E4M3 working best for NLP, E3M4 for CV (Shen et al., 2023).
5. Hardware Acceleration and Execution Characteristics
Recent accelerators (NVIDIA Hopper/H100, Intel Gaudi 2, AMD MI300A) natively support both E4M3 and E5M2 FP8 kernels (Jarmusch et al., 10 Feb 2026, Kim et al., 3 Feb 2025). These devices achieve up to 2× throughput/TFLOPS and 1.8× power efficiency compared to FP16, but actual gains are limited by occupancy, memory bandwidth, and kernel tiling strategies. On AMD MI300A, FP8 MFMA instructions with FP32 accumulation are available; maximum throughput is reached when large numbers (≥256) of active wavefronts are sustained (Jarmusch et al., 10 Feb 2026). For small batch sizes or "thin" GEMMs (common in decode-stage LLM inference), achievable FP8 performance is often less than hardware peak.
FP8 hardware, however, can be 50–180% less efficient than INT8 in terms of pure compute throughput, especially for inference; thus, INT8 remains preferable for edge-centric inference deployments (Baalen et al., 2023). On the other hand, FP8 offers critical advantages for training and high-dynamic-range workloads.
6. Post-training Quantization, Outlier Handling, and Hybrid Kernels
FP8 quantization workflows, as validated in FireQ, integrate outlier smoothing, channel-wise scaling, and RoPE-aware normalization to maintain accuracy under aggressive quantization—crucial for LLMs with rotary positional embeddings (2505.20839). Mixed-precision kernels, e.g., INT4 weights with FP8 activations, can further optimize bandwidth and performance.
Advanced methods, such as dynamic range expansion via companding (COAT) or mixed-precision quantization by per-tensor/activation regime, substantially reduce quantization-induced error while enabling end-to-end FP8 computation (including optimizer states and large layer activations) (Xi et al., 2024).
7. Applications, Conversion, and Limitations
FP8 arithmetic is now widespread in LLM training (matching or exceeding BF16 in speed and downstream accuracy at scale), federated learning (offering 2.9× communication savings over FP32), and scientific computing (enabling 8–53× acceleration for FP64 emulation) (Xi et al., 2024, Wang et al., 2024, Uchino et al., 11 Mar 2026). When converting FP8-trained networks to INT8 for inference, post-training quantization without retraining is feasible, but can incur 50–180% compute efficiency loss unless the workload is specifically optimized for integer ops (Baalen et al., 2023).
FP8 formats are not universally superior: for latency-sensitive inference, low-occupancy kernels, or edge inference with INT-only hardware, INT8 remains preferred. Moreover, relative errors (machine epsilon) are substantially higher than for FP16/BF16, which places inherent accuracy limits on FP8 for outlier-prone or numerically unstable models (Baalen et al., 2023, Micikevicius et al., 2022).
References:
- (Micikevicius et al., 2022) "FP8 Formats for Deep Learning"
- (Baalen et al., 2023) "FP8 versus INT8 for efficient deep learning inference"
- (Kuzmin et al., 2022) "FP8 Quantization: The Power of the Exponent"
- (Mukunoki, 1 Aug 2025) "DGEMM without FP64 Arithmetic -- using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme"
- (Xi et al., 2024) "COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training"
- (Wang et al., 4 Nov 2025) "FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error"
- (Jarmusch et al., 10 Feb 2026) "Execution-Centric Characterization of FP8 Matrix Cores, Asynchronous Execution, and Structured Sparsity on AMD MI300A"
- (Shen et al., 2023) "Efficient Post-training Quantization with FP8 Formats"
- (Kim et al., 3 Feb 2025) "An Inquiry into Datacenter TCO for LLM Inference with FP8"
- (Narayan et al., 9 Feb 2025) "1nit Scaling: Simple and Scalable FP8 LLM Training"
- (Lindberg et al., 2024) "On Approximate 8-bit Floating-Point Operations Using Integer Operations"
- (Tang, 8 Dec 2025) "The Native Spiking Microarchitecture: From Iontronic Primitives to Bit-Exact FP8 Arithmetic"
- (Hernández-Cano et al., 26 May 2025) "Towards Fully FP8 GEMM LLM Training at Scale"
- (2505.20839) "FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration"
- (Wang et al., 2024) "Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point"