FP8 Mixed-Precision Training

Updated 15 April 2026

FP8 mixed-precision training is a technique that uses 8-bit formats (E4M3 and E5M2) to compress tensors, reducing memory and energy usage while maintaining accuracy via higher-precision accumulation.
The workflow converts weights, activations, and gradients to FP8 for GEMM operations, then upcasts them for stable computation using loss scaling and stochastic rounding.
Hardware innovations like the MiniFloat-NN ISA enable significant throughput and energy efficiency gains, supporting various architectures from CNNs to Transformers.

Floating-point 8-bit (FP8) mixed-precision training is a paradigm in deep learning that leverages ultra-low numerical precision to accelerate training and reduce memory and energy consumption, while maintaining accuracy comparable to conventional higher-precision (FP16, BF16, FP32) methods. By compressing tensors into 8-bit floating-point representations—most commonly E4M3 and E5M2—and executing key operations with mixed-precision accumulation and master states in higher precision, FP8 mixed-precision techniques deliver major hardware and algorithmic speedups. Robustness and stability are achieved via loss scaling, careful quantization/dequantization, and architectural or algorithmic innovations addressing outlier dynamics and numerical brittleness. The following sections synthesize major advances, principles, and implementation strategies for FP8 mixed-precision training as established in the research literature.

1. FP8 Formats and Numeric Properties

FP8 refers to floating-point number systems with a total bit width of 8, typically realized as E4M3 (1 sign, 4 exponent, 3 mantissa) and E5M2 (1 sign, 5 exponent, 2 mantissa) variants (Micikevicius et al., 2022). The two encodings differ in their trade-off between dynamic range and precision:

Format	Exponent bits	Mantissa bits	Exponent bias	Normal range	Max normal
E4M3	4	3	7	$2^{-6}$ to $448$	$448$
E5M2	5	2	15	$2^{-14}$ to $57344$	$57344$

E4M3 is chosen for forward activations and weights, prioritizing resolution, while E5M2, with its expanded range and reduced precision, is preferred for backward gradients to mitigate under/overflow in back-propagation (Micikevicius et al., 2022, Fujii et al., 2024).

The representation and conversion entail clamping values into the FP8 dynamic range and applying either round-to-nearest-even or, for improved unbiasedness, stochastic rounding. Subnormals extend the representable set close to zero, but values below the minimum normal are rounded or flushed, and good scaling is required to minimize representational error (Lee et al., 2024, Micikevicius et al., 2022).

2. FP8 Mixed-Precision Training Workflow and Key Operations

The canonical FP8 mixed-precision training recipe combines low-precision storage/computation with higher-precision accumulation and master-state retention (Mellempudi et al., 2019, Micikevicius et al., 2022, Wang et al., 2018):

Forward/Backward GEMMs:
- Input tensors (weights, activations) are quantized to FP8 before each matrix multiplication or convolution.
- Arithmetic is performed by upcasting inputs to FP16/BF16 and accumulating products in high-precision accumulators (FP16, BF16, or FP32).
- Outputs can be re-quantized to FP8 for further propagation (Wang et al., 2018, Micikevicius et al., 2022).
Weight Updates:
- Gradients are accumulated and optimizer moments are maintained in higher precision (FP16/BF16 or, for moment statistics, in compressed FP8 with scale/transform metadata as in COAT (Xi et al., 2024)).
- A master weight copy in FP16 or BF16 is kept, updated, and periodically re-quantized to FP8 for computation (Mellempudi et al., 2019, Xi et al., 2024).
Loss Scaling and Quantization Control:
- Loss scaling is essential: the computed loss is multiplied by a scale factor $L$ before back-propagation to prevent small gradients from underflowing FP8 range. The factor is dynamically adjusted based on overflow/underflow statistics (Mellempudi et al., 2019).
- Stochastic rounding is recommended to counteract the bias of deterministic quantization and preserve the expectation of small updates—critical for gradient fidelity and generalization (Wang et al., 2018, Mellempudi et al., 2019).

3. Hardware and ISA Design for FP8 Mixed-Precision

Efficient FP8 training mandates custom ISA and microarchitecture enhancements. MiniFloat-NN extends the RISC-V ISA with three SIMD-style floating-point instructions supporting FP8/FP16 operations, with special attention to expanding dot-product accumulation and minimizing losses from floating-point non-associativity (Bertaccini et al., 2022). The ExSdotp (expanding sum-of-dot-product) unit is key, fusing two FP8 products and a FP16 accumulator in a single instruction:

$\mathit{ExSdotp}_{2w} = (a_w \times b_w) + (c_w \times d_w) + e_{2w}$

The SIMD wrapper statically unpacks and zero-pads operands to match accumulator widths, and the hardware saves approximately 30% of the area and critical path relative to cascaded expanding fused multiply-add units.

A RISC-V-based 8-core cluster implementing MiniFloat-NN, fabricated in 12 nm FinFET, demonstrates $575\,\mathrm{GFLOPS/W}$ for FP8-to-FP16 GEMMs, indicating the extraordinary energy efficiency attainable via native FP8 support (Bertaccini et al., 2022).

4. Stability Mechanisms and Outlier Control

FP8 training is sensitive to numerical instability, especially in the presence of extreme activation outliers or small-magnitude gradients. Compensatory mechanisms include:

Enhanced Loss Scaling: Ensures that all gradients reach the FP8 normal range before quantization. Empirically, training ResNet-50 on ImageNet fails for $L=10^3$ but succeeds at $448$0 (Mellempudi et al., 2019).
Stochastic Rounding: Replaces deterministic rounding, particularly for gradient and activation quantization, by rounding to adjacent representable values with probabilities dictated by proximity. This controls quantization noise and prevents silent gradient vanishing (Mellempudi et al., 2019, Wang et al., 2018).
Architectural and Regularization Strategies: Innovations such as TWEO penalize the 4th power of block-level activations, thereby suppressing activation outliers to match FP8 dynamic range and achieve BF16-equivalent convergence in large transformers (Liang et al., 28 Nov 2025). The μnit scaling technique enforces static, unit-variance scaling across layers, precluding the need for dynamic per-layer scale factors and providing exact numeric parity between training and inference at FP8 (Narayan et al., 9 Feb 2025).

5. FP8 Mixed-Precision Results: Accuracy, Efficiency, and Limitations

Extensive empirical studies confirm that, with proper mitigation, FP8 training achieves accuracy matching or exceeding full-precision baselines in computer vision, language, and machine translation tasks:

Model	FP32 Accuracy/BLEU	FP8 Mixed (%)
ResNet-18 (ImageNet)	69.23	69.71
ResNet-50 (ImageNet)	75.47	75.70
GNMT BLEU	24.3	24.6
Transformer BLEU	23.6	23.0

Throughput improvements of $448$1 (or more) over FP16/FP32 are observed, and specialized hardware such as the MiniFloat-NN RISC-V cluster achieves up to $448$2 in GEMM workloads (Bertaccini et al., 2022). FP8 memory-bandwidth savings are also substantial, as weights and activations require only 8-bit representations, enabling training of larger models or larger batch sizes within fixed memory budgets (Xi et al., 2024).

However, correct operation depends on maintaining master weights in higher precision, applying robust loss scaling, and deploying stochastic rounding and/or outlier regularization. Inadequate scaling or precision for accumulators and master copies leads to degradation, instability, or divergence, particularly in large models and deep networks (Mellempudi et al., 2019, Liang et al., 28 Nov 2025).

6. Scope and Practical Application

FP8 mixed-precision is now broadly validated across diverse neural architectures (CNNs, RNNs, Transformers) and scales (from CIFAR-10 to 175B-parameter LLMs), in both training and inference. Modern accelerator ISAs (e.g., NVIDIA Hopper, ARM, Intel) and open-source hardware blocks (e.g., MiniFloat-NN) are incorporating native FP8 support (Micikevicius et al., 2022, Bertaccini et al., 2022).

Typical usage prescribes:

Quantizing all activation, weight, and error tensors to FP8 except in the most numerically sensitive modules (first/last layers, softmax/logits).
Implementing all GEMMs and convolutions with upcast FP8 input and FP16/FP32 accumulator.
Using stochastic rounding and loss scaling throughout.
Preserving master weights/optimizer states in at least FP16, updating in FP32.

This approach yields the highest hardware efficiency, given robust software and hardware support. For broader deployment, configurable ISA extensions, hardware-unified FP8 units, and co-designed regularization schemes (e.g., TWEO) further expand the reliability and utility of mixed-precision FP8 platforms (Liang et al., 28 Nov 2025, Mellempudi et al., 2019, Bertaccini et al., 2022).

References