FP8 Mixed-Precision Training
- FP8 Mixed-Precision Training is a framework that uses 8-bit floating point operations to boost training throughput and reduce memory requirements for large Transformer models.
- TWEO introduces a loss regularizer to systematically suppress extreme activation outliers, thereby preventing numerical instability during FP8 training.
- The approach achieves comparable accuracy to BF16 standards and enables efficient post-training quantization with up to 36% throughput gain.
A mixed-precision training framework based on 8-bit floating point (FP8) enables deep neural network training, especially for large Transformer architectures, to leverage the hardware and energy efficiency advantages of very low precision while maintaining high numerical stability and accuracy. FP8 mixed-precision is challenging because it exposes the training process to catastrophic loss explosion and gradient underflow, primarily due to the presence of extreme activation outliers that fall outside the limited representational range of the FP8 format. Recent advances such as the TWEO framework systematically address these challenges and allow for robust, high-throughput, and hardware-friendly End-to-End FP8 training (Liang et al., 28 Nov 2025).
1. Definition and Motivation for FP8 Mixed-Precision Training
FP8 mixed-precision training refers to neural network optimization pipelines in which all or most forward and backward tensor operations use 8-bit floating-point representations for storage and computation, while certain critical operations (e.g., accumulation, optimizer state) may use higher precision (typically BF16/FP16/FP32). The primary goals are:
- Throughput maximization: FP8 halves memory bandwidth and doubles peak MAC throughput compared to BF16 regimes.
- Memory footprint reduction: Model, optimizer, and activation storage requirements drop by at least 2×, which can directly translate into training larger models or using fewer GPUs.
- Hardware efficiency: Acceleration units (e.g., tensor cores) can exploit the reduced bitwidth for increased energy efficiency.
Modern hardware support for FP8 includes NVIDIA Hopper's Transformer Engine and various open accelerator initiatives. However, unmitigated FP8 training causes numerical instability due to extreme outlier values in Transformer residual streams, which exceed the representable range of FP8 (e.g., ±448 for E4M3), resulting in overflows and unstable optimization (Liang et al., 28 Nov 2025).
2. Failure Modes: Activation Outliers and Their Origin
A central impediment for robust FP8 training of Transformers is the formation of extreme outliers—activation values with magnitude orders of magnitude above the bulk of the distribution, often exceeding 1000× the standard deviation, particularly in the residual streams of both Vision (ViT/Swin) and Language (GPT-style) Transformer blocks.
TWEO provides the first systematic analysis of the mechanical, data-independent genesis of these outliers. For a standard Transformer block's MLP forward pass abstracted as , singular value decomposition reveals that when the down-projection weight row aligns with a left singular vector of the up-projection, and the input aligns with a right singular vector, their product can be arbitrarily large. This mechanical amplification is inherent to the network's structure (especially in GLU-style MLPs with anti-colinear projections) and is not attributable to data or random chance. Such colinearities recur in every block and are responsible for systematically producing FP8-overflowing activations (Liang et al., 28 Nov 2025).
3. TWEO: Blockwise Outlier Suppression via Loss Regularization
The TWEO (Transformers Without Extreme Outliers) framework introduces a universal, non-invasive regularizer targeting the tails of activation distributions. Its mathematical structure is as follows: with
where:
- is the output activation of Transformer block ,
- is a soft threshold (default 3.0),
- is the penalty power (default 4, negligible penalty below threshold, superlinear above),
- ensures stability,
- is a small weight (e.g., 0.01, optionally cosine-annealed).
TWEO primarily activates for rare, large-magnitude activations, letting "normal" units pass, thus constraining outliers that would otherwise produce catastrophic FP8 overflows. The penalty is blockwise and statistically averaged over all activations at each training iteration (Liang et al., 28 Nov 2025).
4. End-to-End FP8 Training Pipeline Enabled by TWEO
With TWEO outlier suppression, full end-to-end FP8 pretraining becomes feasible without auxiliary engineering such as:
- Selective fallback of sensitive modules (e.g. softmax, embeddings) to BF16,
- Use of "register tokens," atypical normalization, or custom tile-wise scaling,
- Special architectural surgery (e.g., modifying residual branches, gating structures).
Instead, every Linear and LayerNorm module—including token embeddings and output heads—operates under per-tensor FP8 autocasting throughout. The training loop leverages hybrid E4M3/E5M2 formats (with amax_history_len=16 in hardware transform engines). No module requires BF16 fallback. This approach is stable for both vision (ViT/Swin series) and language (GPT-2, GPT-3 class) models, up to at least 7B parameters (Liang et al., 28 Nov 2025).
Empirical metrics:
| Model | FP8+TWEO | BF16 | (Gap) |
|---|---|---|---|
| GPT-2 124M (PPL) | 19.26 | 18.68 | +0.58 |
| GPT-2 XL 1.6B (PPL) | 12.58 | 12.39 | +0.19 |
| Swin-T (top-1, %) | 81.4 | 81.2 | +0.2 |
| Throughput (A100) | +36% | — | — |
Standard FP8 training diverges within a few thousand steps (activation peaks >10,000). With TWEO, loss tracks the BF16 baseline within 0.1% relative (Liang et al., 28 Nov 2025).
5. Impact on Quantization: Hardware-Friendly W8A8 Post-Training
TWEO-trained models exhibit strong quantization resilience. After training, activation distributions are tightly concentrated (peak <20), allowing for direct, naively symmetric, per-tensor, static post-training quantization in 8 bits for both weights and activations (W8A8 with per-tensor scaling). The quantizer is defined as:
Key downstream properties:
- GPT-2 Medium: W8A8 per-tensor PPL degrades by only 0.2–0.5%, versus total collapse for non-TWEO models (PPL >1400).
- Swin-T: W8A8(T) accuracy is 80.94% vs 81.4% (BF16), while baseline models collapse to ~77% or worse.
This finding represents a paradigm shift—naive per-tensor quantization, universally deemed infeasible in LLMs due to outliers, is now practical and SOTA when models are trained under TWEO (Liang et al., 28 Nov 2025).
6. Implementation, Hyperparameters, and Best Practices
TWEO is implemented by augmenting the existing loss with the block-output regularizer as described above; hyperparameters (, , , ) are robust across model scales (28M–7B), modalities (language, vision), and optimizer settings. The framework is compatible with hardware-native FP8 support, such as NVIDIA's Transformer Engine. Key recommendations:
- FP8 autocast is globally enabled (no selective fallback to BF16 anywhere).
- Per-tensor scaling is used in hardware (amax_history_len=16).
- TWEO is applied at every residual block output, including MLP and attention branches.
- The approach is strictly architectural-agnostic and does not require any modifications to Transformer internals.
This methodology transforms previously collapse-prone and finicky FP8 training regimens into robust, straightforward, and hardware-optimal pipelines, with significant throughput gains and memory savings (Liang et al., 28 Nov 2025).
References
- "TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies" (Liang et al., 28 Nov 2025)