FP8 Precision in Deep Learning
- FP8 is a reduced-precision numerical format with two main types (E4M3 and E5M2) that enable efficient deep neural network training and inference.
- FP8 quantization employs per-tensor and per-channel scaling, rounding, and clipping to map tensors to 8-bit representation while managing dynamic range.
- Stabilization techniques like Smooth-SwiGLU and dynamic range expansion are critical for mitigating numerical instabilities and preserving model accuracy.
Floating-point 8-bit (FP8) precision constitutes a reduced-precision numerical format now widely adopted in training and inference for deep neural networks, particularly LLMs. FP8 formats are defined by an 8-bit binary interchange, most commonly E4M3 (1 sign bit, 4 exponent bits, 3 mantissa bits) and E5M2 (1 sign bit, 5 exponent bits, 2 mantissa bits), approximating IEEE-style floating point conventions while compressing both dynamic range and arithmetic resolution relative to FP16, BF16, or FP32. The move to FP8 is driven by hardware support on recent AI accelerators, promising ∼2× throughput and up to ∼34%–40% memory savings with minimal impact on model quality provided critical stability measures are enforced (Fishman et al., 19 Sep 2024, Liang et al., 28 Nov 2025).
1. FP8 Numeric Formats and Value Representation
FP8 employs two principal formats, specified by the bit allocation for exponent (e) and mantissa (m):
| Format | Sign bits | Exponent bits | Mantissa bits | Bias | Max normal | Min normal | ε (machine) |
|---|---|---|---|---|---|---|---|
| E4M3 | 1 | 4 | 3 | 7 | 240 | 1.56e–2 | 0.125 |
| E5M2 | 1 | 5 | 2 | 15 | 57,344 | 6.10e–5 | 0.25 |
A normalized FP8 value is represented as
where s is the sign bit, e is the exponent (unsigned, encoded), m is the integer mantissa, M is the mantissa width, and the bias is (Fishman et al., 19 Sep 2024, Micikevicius et al., 2022). E4M3 prioritizes precision (0.125 machine epsilon) with a moderate dynamic range; E5M2 offers larger dynamic range but coarser granularity. E4M3 does not encode distinct Infinities and reclaims all but one NaN pattern, while E5M2 follows IEEE-754 for special values.
Relative to FP16 and BF16, FP8's ∼8.6–19 powers-of-two dynamic range (vs. ∼10⁵ for FP16 and ∼10³⁸ for BF16) and minimal mantissa bits introduce significant quantization error and require dedicated stabilization for gradient and activation management (Lee et al., 29 May 2024).
2. Scaling and Quantization Methodologies
FP8 quantization maps floating-point tensors to 8-bit numerical representation via per-tensor or per-channel scaling, rounding, clipping, and dequantization. The canonical procedure is as follows (Fishman et al., 19 Sep 2024, Peng et al., 2023):
- Determine a scale: , where is the maximum finite FP8 value.
- Quantize: .
- For critical layers prone to outlier activations (e.g., last MLP input), per-channel scaling is employed: .
Delayed scaling, where the per-tensor scale is set by the previous iteration's maximum absolute value, maximizes hardware throughput but may defer outlier exposure; per-channel scaling mitigates catastrophic clipping by localizing scaling factors to high-variance subspaces (Fishman et al., 19 Sep 2024, Peng et al., 2023). Clipping occurs naturally for values exceeding representable range.
Empirical studies demonstrate the importance of matching quantization granularity and scaling calibration to workload, with NLP models favoring E4M3 and per-channel scaling, and vision models occasionally benefitting from E3M4 (finer precision, lower range) (Shen et al., 2023).
3. Instabilities, Outlier Mechanisms, and Stabilization Techniques
Long-horizon LLM training in FP8 reveals late-stage divergent behaviors not observable in shorter or higher-precision regimes. The primary instability is outlier amplification initiated in nonlinear activations—specifically SwiGLU. The mechanism is a quadratic outlier produced when two parameter vectors in SwiGLU align ():
Such alignment is an emergent property under regularization and protracted gradient descent, leading to correlated parameter blowup and sporadic, severe activation spikes exceeding the FP8 representable range (Fishman et al., 19 Sep 2024, Liang et al., 28 Nov 2025).
Remediation—Smooth-SwiGLU:
Smooth-SwiGLU inserts a per-channel scale into the MLP block, tightly bounding the product , quantizing only after this controlled scaling. At inference, can be folded into adjacent weight layers, incurring zero runtime cost. This eliminates instabilities up to and exceeding 2T tokens and ensures loss curves and downstream accuracy are on-par with BF16 (Fishman et al., 19 Sep 2024).
Additional stabilization approaches include dynamic range expansion for optimizer states (Xi et al., 25 Oct 2024), variance-preserving architectural modifications (e.g., fixed skip-connected residuals, "unit scaling") (Narayan et al., 9 Feb 2025), and data-independent outlier suppression via loss penalties (TWEO) (Liang et al., 28 Nov 2025).
4. FP8 Quantization of Optimizer States
Standard AdamW optimizer maintains first (m) and second (v) moment vectors per parameter. Prior approaches retained these states in higher precision due to FP8's limited range and resolution, but recent work demonstrates:
- can be stored/updated in E4M3 (sufficient mantissa precision).
- requires E5M2 (max dynamic range to preserve small second moments).
- Bias correction for both moments is performed in FP8; (numerical stabilizer) is represented in E5M2 (Fishman et al., 19 Sep 2024, Xi et al., 25 Oct 2024).
- Dynamic Range Expansion (DRE): exponentiation and per-group scaling are used to map optimizer state distributions to fully occupy the FP8 exponent space, reducing quantization-induced MSE by (Xi et al., 25 Oct 2024).
With full FP8 quantization of and , up to 30% training memory reduction is realized, facilitating larger batch sizes and model scaling without accuracy loss.
5. Performance Metrics and Empirical Evaluation
FP8 precision delivers significant throughput and memory gains:
| Metric | BF16 | FP8 w/ stabilization | Relative Change |
|---|---|---|---|
| Throughput (samples/s, 8 Gaudi2) | 12.65 | 16.89 | +33.5% |
| Memory (GB/HPU) | 63.25 | 44.08 | –30% |
| Zero-shot metrics | (Lambada) 61.98 | 61.73 | <0.3 pp diff |
| Model divergence at >200B tokens | No (BF16) | Yes (FP8 naive); No (FP8+Smooth-SwiGLU) | – |
FP8 achieves 34% end-to-end speedup at scale and matches BF16 for test accuracy and perplexity on representative LLM and vision workloads, provided outlier suppression is in place (Fishman et al., 19 Sep 2024, Liang et al., 28 Nov 2025). For inference, memory and bandwidth requirements are halved relative to 16-bit baselines, with operator-level matrix-multiplier utilization exceeding 90% on supporting hardware (Lee et al., 13 Mar 2025).
6. Best-Practice Guidelines and Deployment Recommendations
Key FP8 deployment guidelines derived from large-scale studies (Fishman et al., 19 Sep 2024, Peng et al., 2023, Liang et al., 28 Nov 2025):
- Use E4M3 for all forward weights and activations; assign E5M2 for backward gradients.
- Apply delayed scaling per-tensor for maximal throughput, resorting to per-channel scaling in layers sensitive to outliers (last MLP input, LayerNorm, embeddings).
- Monitor / alignment in SwiGLU, and track per-channel maxima for activation outlier detection.
- Integrate stabilization techniques (Smooth-SwiGLU, dynamic range expansion, explicit outlier penalties) into the main training loop.
- Quantize optimizer moments as (E4M3), (E5M2) with full bias correction. Maintain identical hyperparameters (β₁=0.9, β₂=0.999, , LR) as BF16 runs.
- Test convergence with full FP8 on small-scale runs before scaling to production workloads.
- Validate stability over token regimes at least longer than prior art (≥200B tokens) to confirm late-stage accuracy.
7. Context, Limitations, and Outlook
While E4M3/E5M2 FP8 provides a practical, hardware-amenable pathway to reduced-precision model scaling, the restriction in both dynamic range and arithmetic granularity mandates continual monitoring for divergence and necessitates flexible integration of stabilization approaches (Lee et al., 29 May 2024, Fujii et al., 10 Nov 2024). Ongoing work is focused on algorithmic and architectural enhancements for even tighter error control (e.g., block quantization, simulated stochastic rounding, variance-preserving residual design) and enabling FP8 deployment across broader model classes.
Moving forward, the canonical FP8 methodology—combining E4M3/E5M2, per-channel/tensor scaling, architectural outlier suppression, and AdamW moment quantization—defines a robust template for scaling to trillion-token LLM workloads and beyond, achieving ∼34%–40% throughput improvements at scale with matched or negligible task performance degradation relative to legacy 16-bit implementations (Fishman et al., 19 Sep 2024, Liang et al., 28 Nov 2025).
References:
- “Scaling FP8 training to trillion-token LLMs” (Fishman et al., 19 Sep 2024)
- “FP8 Formats for Deep Learning” (Micikevicius et al., 2022)
- “TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies” (Liang et al., 28 Nov 2025)
- “COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training” (Xi et al., 25 Oct 2024)
- “Mixed Precision Training With 8-bit Floating Point” (Mellempudi et al., 2019)
- “To FP8 and Back Again: Quantifying Reduced Precision Effects on LLM Training Stability” (Lee et al., 29 May 2024)
- “Efficient Post-training Quantization with FP8 Formats” (Shen et al., 2023)
- “Faster Inference of LLMs using FP8 on the Intel Gaudi” (Lee et al., 13 Mar 2025)
- “FP8-LM: Training FP8 LLMs” (Peng et al., 2023)