Float8 Training for Efficient Deep Learning

Updated 19 March 2026

Float8 training is a technique that uses 8-bit floating-point formats (e.g., FP8 variants) to reduce compute and memory use while retaining competitive model accuracy.
It utilizes precision assignment, quantization/dequantization, and dynamic promotion to manage numerical instability and prevent overflow during neural network training.
Empirical studies show that Float8 training achieves accuracy comparable to FP16/FP32 methods across vision, language, and federated tasks with significant resource savings.

Float8 training refers to the use of 8-bit floating-point numerical formats in the training of deep neural networks. These formats, typically denoted by abbreviations like FP8 or Float8, offer a trade-off between reduced compute/memory resource consumption and potential numerical instability or accuracy loss relative to standard 16- or 32-bit floating-point representations. Recent research demonstrates that with appropriate algorithmic adaptations, Float8 training achieves substantial efficiency gains without significant degradation in convergence or final model quality across a wide array of domains, including large-scale vision, language, federated learning, and extreme classification tasks.

1. Definitions and Float8 Numerical Formats

Modern Float8 formats are IEEE-inspired binary interchange representations, the most canonical being E4M3 (1 sign, 4 exponent, 3 mantissa bits) and E5M2 (1 sign, 5 exponent, 2 mantissa bits) (Micikevicius et al., 2022). These formats encode real numbers as: $x = (-1)^s \cdot (1 + M/2^m) \cdot 2^{E-b}$ for "normal" numbers (where $s$ is the sign bit, $M$ is the mantissa, $E$ is the exponent, $b$ is the exponent bias), and as

$x = (-1)^s \cdot (M/2^m) \cdot 2^{1-b}$

for subnormals. E4M3 extends the dynamic range by repurposing the exponent-all-ones field, lacking infinities, while E5M2 follows full IEEE-754 semantics, including special encodings for $\infty$ and NaN.

A selection of Float8 formats used in the literature:

Name	Exponent	Mantissa	Bias	Dynamic Range	Use
E4M3	4	3	7	± $1.18\!\times\!10^{-3}..±143$	Activations/weights (Micikevicius et al., 2022, Liang et al., 2024)
E5M2	5	2	15	± $6.10\!\times\!10^{-5}..±6.55\!\times\!10^4$	Gradients (Micikevicius et al., 2022, Liang et al., 2024)
HiFloat8	Tapered-precision	(1–3)	variable	38 binades	Unified forward/backward (Luo et al., 2024)

Formats are chosen per tensor type, with wider exponent (E5M2) for gradients and narrower (E4M3) for forward-pass activations and weights. Some approaches (HiFloat8) use a single tapered-precision Float8 format, assigning more mantissa bits to lower exponents for a superior trade-off between range and precision (Luo et al., 2024).

2. Core Algorithmic Techniques

Float8 training requires multiple modifications to conventional mixed-precision training pipelines (Mellempudi et al., 2019, Lee et al., 2023, Micikevicius et al., 2022):

Precision Assignment: Tensors in the computational graph are grouped (e.g., between GEMM operators), sorted by size, and demoted from high (FP16/FP32) to low (FP8) precision to meet a target memory ratio. During training, if the overflow rate in any group exceeds a threshold (typically 1%), all tensors in that group are promoted back to high precision (Lee et al., 2023).
Quantization/Dequantization: Input data is quantized to FP8 via scaling and rounding (nearest or stochastic), optionally using per-tensor or per-row scaling factors. After computation, results are dequantized for further processing. In large-scale systems (e.g., TorchTitan), dynamic scale estimation occurs every forward/backward pass; in μnit Scaling, a static, theoretically-derived 1/ $\sqrt{\mathrm{fan\_in}}$ scale is used for all hidden linear layers (Narayan et al., 9 Feb 2025).
Loss/Gradient Scaling: To address the small dynamic range/subnormal coverage of FP8, the backward loss or gradients are scaled to prevent underflow. Static or dynamic strategies are adopted, often with an overflow-detection mechanism that halves the loss scale and skips the step on overflow (Mellempudi et al., 2019, Luo et al., 2024, Noune et al., 2022).
Rounding Mode: Stochastic rounding has been found critical to avoid systemic quantization bias, particularly for deeper networks or massive classification heads (Mellempudi et al., 2019, Zhang et al., 13 Oct 2025). Round-to-nearest is used if quantization noise or bias can otherwise be tolerated.
Dynamic Promotion: To guarantee convergence and avoid numerical instability/divergence, tensors exhibiting high overflow rates in FP8 are dynamically promoted to higher precision for the rest of training (Lee et al., 2023).

3. System Integration and Distributed Support

From practical deployment in frameworks to large-scale distributed systems:

Framework Support: Float8 is being integrated as a first-class dtype (e.g., torchao.float8 in PyTorch), with hardware-level support on NVIDIA H100 GPUs (FP8 Tensor Cores) and Graphcore IPUs (Liang et al., 2024, Balança et al., 2024).
Scaling Policies: Both static (fixed scale per tensor) and dynamic (estimated per pass as $\max(|x|)/q_{\max}$ ) scaling policies are implemented (Liang et al., 2024).
Hardware-Software Co-design: Kernel fusion, pipeline demarcation (e.g., via torch.compile), and extension of memory subsystems (SymmetricMemory) are used to maximize bandwidth and overlap FP8 computation/communication (Liang et al., 2024).
Distributed (Federated) Learning: Float8 enables low-overhead on-device training and communication. The FP8FedAvg-UQ approach combines unbiased stochastic quantization, per-tensor clipping, and a global FP32 server model. Empirical results show up to 6x communication savings with negligible accuracy loss (Wang et al., 2024).

4. Empirical Results and Model Performance

Float8 training, when combined with the above algorithmic and system strategies, achieves high-fidelity convergence across vision, language, and federated learning workloads:

Image Classification (CNNs): On ImageNet, ResNet-50 trained with E5M2-based FP8 plus loss scaling matches or slightly exceeds FP32 accuracy (75.47% FP32 vs 75.70% FP8/e5m2) (Mellempudi et al., 2019). With S2FP8 (dynamic shift+squeeze), accuracy degradation remains below 1% even without tuned loss scaling (Cambier et al., 2020).
LLMs: Llama 3.1 models (8–405B parameters) trained using Float8 via TorchTitan retain perplexity within 0.1 versus BF16 across 15T tokens, with 50–65% throughput speedups (Liang et al., 2024). μnit Scaling eliminates the need for hyperparameter rescaling/tuning and achieves convergence and accuracy on par with mixed-precision BF16, up to model sizes of 13B (Narayan et al., 9 Feb 2025).
Extreme Multilabel and Output-Space Models: Classifier heads with 3M–18M labels trained entirely in Float8, with BF16-based Kahan summation and stochastic rounding, yield >6× memory reduction and precision matching or closely following FP32 baselines (Zhang et al., 13 Oct 2025).
Federated Learning: FP8FedAvg-UQ reduces client-server communication by ≥2.9×, with test accuracies within 1% of full-precision baselines across i.i.d. and non-i.i.d. splits (Wang et al., 2024).

5. Advanced Techniques and Alternatives

Scale Propagation (Scalify): Scale-propagation approaches introduce explicit scale tracking and propagation through the computation graph, minimizing the need for frequent rescaling and reducing overhead. Only a handful of strategic dynamic rescale points are used (e.g., after LayerNorm backward) (Balança et al., 2024).
Tapered and Adaptive Formats: HiFloat8 implements a single format with mantissa count tapered to the exponent binade, bridging the gap between precision and dynamic range and achieving near FP16 performance in both forward and backward passes (Luo et al., 2024).
Hardware Support: Open RISC-V clusters and ExSdotp units enable efficient, associative 8/16-bit operations for training, achieving up to 575 GFLOPS/W for FP8→FP16 GEMMs (Bertaccini et al., 2022). Custom hardware architectures further reduce energy and area per MAC.
Full Low-Precision Heads: In the extreme classification regime, classification-head memory and compute can dwarf the encoder. Pure FP8 heads (plus Kahan and stochastic rounding) match accuracy at a fraction of the memory overhead, with custom kernels fusing backward gradient calculation and updates in SRAM (Zhang et al., 13 Oct 2025).

6. Limitations, Best Practices, and Open Problems

While Float8 training is now viable for a broad array of architectures, several issues remain under investigation:

Task/Model Sensitivity: Uniform all-FP8 precision often results in divergence, especially in small or "hard" models. Layer-specific or groupwise adaptation with live fallback to higher precision is necessary for robust training (Lee et al., 2023).
Scaling/Overflow Management: Per-tensor scale selection is critical. Too broad or infrequent scaling leads to saturation or underutilization; frequent per-matmul scaling incurs computational overhead. Approaches like μnit Scaling (static) and Scale Propagation (minimal dynamic correction) attempt to balance this tradeoff (Narayan et al., 9 Feb 2025, Balança et al., 2024).
Hyperparameter Tuning and Transfer: Static scaling protocols such as μnit Scaling remove the need for bespoke tuning across model sizes—hyperparameters like learning rate and weight decay can be transferred directly with simple formulas involving model width (Narayan et al., 9 Feb 2025).
Accumulation and Optimizer State: Most schemes accumulate in FP32 or BF16 and retain master weights in at least FP16. Gradient and optimizer state compression to FP8 is possible but requires explicit management of scale and, often, the use of higher exponent ranges (Balança et al., 2024).
Rounding and Regularization: Stochastic rounding is empirically required for unbiased, stable training in deep or wide models, as round-to-nearest introduces subtle bias (Mellempudi et al., 2019, Zhang et al., 13 Oct 2025).
Hardware Ecosystem: Broad adoption depends on first-class support in major ML frameworks, continued hardware development (widespread FP8 Tensor Core deployment), and continued software-hardware co-optimization (Liang et al., 2024, Bertaccini et al., 2022).

7. Comparative Summary of Major Float8 Training Algorithms

Method/Framework	Scaling Strategy	Promotion/Overflow	Empirical Highlights	Reference
Mixed-Precision Assignment	Largest-first demotion, overflow-triggered promotion	Tensor-wise live FP8→FP16 fallback	>2× mem savings, full convergence on CIFAR/ImageNet	(Lee et al., 2023)
Static Per-Layer μS	Static unit-variance, no dynamic rescale	None needed	Full LLM FP8 training (up to 13B), matches BF16	(Narayan et al., 9 Feb 2025)
Scale Propagation	Persistent scale over graph, rare dynamic rescale	Only at LayerNorm/critical points	FP8 loss curves overlay FP16; 30% memory/35% speedup	(Balança et al., 2024)
S2FP8/Shifted-Squeezed	Per-tensor shift/squeeze via statistics	No loss scaling required	<1% accuracy drop without tuning/residual high-prec	(Cambier et al., 2020)
Stochastic Rounding	Optional on all quantizations	Required for very deep/wide nets	Unbiased, eliminates quantization noise bias	(Mellempudi et al., 2019, Zhang et al., 13 Oct 2025)
ELMO FP8 XMC Head	Pure FP8 head, BF16+Kahan encoder	Stochastic rounding in kernel	6×–13× memory reduction, equivalent precision	(Zhang et al., 13 Oct 2025)
TorchTitan	Dynamic scale, overflow clamping	Clamped, optional reduction	50–65% speedup, 0.1 perplexity loss on Llama-3.1	(Liang et al., 2024)

Taken collectively, the modern Float8 training ecosystem incorporates robust numerical schemes, precise tensorwise adaptation, and direct hardware and software integration, enabling full-precision-level accuracy at a fraction of the computational and bandwidth cost previously required. These advances have made Float8 a credible and practical datatype for neural network training at scale.