FP8 Training on Trillion-Token Datasets

Updated 15 December 2025

The paper introduces FP8 training methods that combine precision format engineering, dynamic loss scaling, and regularization to maintain stability and throughput on trillion-token datasets.
FP8 training uses E4M3 and E5M2 formats with per-tensor scaling and quantization strategies to mitigate precision loss and manage activation outliers.
Practical implementations such as μnit Scaling and FP8-Flow-MoE offer casting-free dataflows and substantial memory reductions, ensuring robustness and efficiency at scale.

Training LLMs using 8-bit floating point (FP8) formats on trillion-token datasets has enabled major advances in computational efficiency, memory savings, and throughput. FP8 training is now feasible at O(10¹²)-token scale, but exposes novel challenges in numerical stability due to the limited dynamic range and precision of FP8 arithmetic. Recent methods combine precision format engineering, architectural variants, activation regularization, and optimizer quantization to address loss divergence, activation outliers, and quantization error.

1. FP8 Numeric Formats and Quantization Mechanisms

FP8 is defined by two primary interchange formats: E4M3 (4-bit exponent, 3-bit mantissa) and E5M2 (5-bit exponent, 2-bit mantissa), both with a single sign bit. E4M3 offers max normal = 448, min normal ≈ 0.016; E5M2 extends to max normal = 57,344, min normal ≈ 6.1×10⁻⁵. E5M2 closely follows IEEE-754 conventions for special values (zero, infinity, NaN), while E4M3 reclaims infinity bit-patterns for extended dynamic range (Micikevicius et al., 2022). Typical FP8 quantization involves computing a scale $S$ for each tensor (per-tensor, -block, or -group granularity), rounding $x/S$ to nearest representable FP8 value, and saturating out-of-range values. Dequantization simply returns $x_{dq} = S x_{q}$ .

FP8 training is supported by hardware such as NVIDIA H100, which uses E5M2 in “hybrid” TensorCore mode. Quantization error is mitigated by setting scale factors based on maximum absolute tensor values (amax tracking), updating scale from a history buffer (often length 16–32), and optionally introducing a margin to widen the representable range (Fujii et al., 10 Nov 2024). Dynamic loss scaling, analogous to FP16 AMP, attenuates gradient overflows during the backward pass.

2. Numerical Stability, Outlier Suppression, and Loss Divergence

Standard FP8 training is fragile at scale due to activation outliers. These outliers originate from architectural and optimization artifacts—not just data properties (Liang et al., 28 Nov 2025). Key sources include:

Quadratic growth in FFN activations (especially SwiGLU) and residual streams (Fishman et al., 19 Sep 2024).
Colinearity in weight matrices, which mechanically drives extremely large activations in MLP blocks as training progresses (Liang et al., 28 Nov 2025).

Mitigation strategies:

TWEO (Transformers Without Extreme Outliers): a loss regularizer penalizing heavy tails in post-MLP residual activations via $\mathcal{L}_{\text{TWEO}} = \frac{1}{L} \sum_{l=1}^{L} \mathbb{E}_{b,s,h}\left[\left(\frac{|A^{(l)}_{b,s,h}|}{\tau+\epsilon}\right)^p\right]$ where $p$ is typically 4 and $\tau$ ≈ 3; this reduces outliers from >10,000 to <20 per layer (Liang et al., 28 Nov 2025).
Smooth-SwiGLU: analytically demonstrates SwiGLU weight alignment drives outlier spikes beyond 200B tokens; Smooth-SwiGLU applies per-channel scaling and post-scaling to cap extremal values while proving identical function behavior (Fishman et al., 19 Sep 2024).
μnit Scaling: restricts tensor variance throughout the stack via square-root softmax, fixed residual weighting, and universal 1/√fan_in scaling so all tensors remain O(1) and fit FP8 range, removing need for dynamic scales (Narayan et al., 9 Feb 2025).
FOG block design: removes pre-normalization, freezes RMSNorm gain, post-normalizes every residual branch, and replaces “outlier-amplifying” activations, achieving sublinear kurtosis growth over trillion-token regimes (Hernández-Cano et al., 26 May 2025).

3. End-to-End FP8 Training Pipelines and Dataflows

FP8 training now supports full network flows or almost casting-free data paths:

μnit Scaling: Lightweight patch (10–20 lines) to standard Transformer, replacing all matmuls/GEMMs with FP8 calls, static scale, post-LN, and explicit residual weighting—no per-tensor scale tracking, no optimizer tuning (Narayan et al., 9 Feb 2025).
FP8-Flow-MoE: Realizes casting-free, quantization-consistent FP8 MoE blocks via scaling-aware tilewise transpose. Quantization error from double-casting is eliminated by using tile-wise power-of-two scaling and direct FP8 exponent updates rather than repeated Q/DQ pairs; only entry and exit points in BF16 (Wang et al., 4 Nov 2025).
COAT: For memory efficiency, COAT compresses activations and optimizer states. Per-group dynamic range expansion (compander) aligns moment distributions to FP8 range, and mixed-granularity quantization uses per-tensor for linear ops and per-group for nonlinearity inputs. Memory reduction up to 1.65× (activations), 2× (optimizers) over BF16; throughput speedup 1.43× (Xi et al., 25 Oct 2024).

4. Training Setup, Hyperparameters, and Scaling to Trillion Token Regimes

Recent experiments demonstrate end-to-end stability and performance at trillion-token scale and beyond:

405B parameter models have been trained on 15.6T tokens (Llama 3 Herd) using FP8 with hardware support and dynamic scale scheduling (Fujii et al., 10 Nov 2024).
Llama2-7B was trained on 2T tokens with FP8 activations, weights, and optimizer states, achieving 33% speedup and matching BF16 accuracy via Smooth-SwiGLU and Adam moment quantization (E4M3 for $m_t$ , E5M2 for $v_t$ ) (Fishman et al., 19 Sep 2024).
DeepSeek-V3 MoE (671B params) uses FP8-Flow-MoE to reduce memory by 16.5 GB/GPU, increase throughput by 21%, and match BF16 convergence over 200B tokens (Wang et al., 4 Nov 2025).

Typical large-scale settings:

Optimizer: AdamW or Adam, β₁=0.9, β₂=0.95, ε=1e-8 or 1e-6, weight decay ≈0.1 (Fishman et al., 19 Sep 2024, Fujii et al., 10 Nov 2024).
LR schedule: cosine decay, linear warmup 1K–10K steps.
Batch size: 1024–128K tokens/step globally; context length 2K–4K.
Scale scheduling: amax_history_len 16–32, fp8_margin 1–2, fp8_interval ≥4 for robust scale updates (Fujii et al., 10 Nov 2024).
Quantization: per-tensor for GEMM-heavy ops, per-group for nonlinear activations and optimizer states (Xi et al., 25 Oct 2024).

5. Empirical Speed, Memory, and Quality Outcomes

FP8 training consistently yields 25–40% throughput improvement over BF16 and significant memory savings:

Method	Throughput (vs BF16)	Memory Reduction	Stability/Accuracy
μnit Scaling (Narayan et al., 9 Feb 2025)	25–33%	–	Matches BF16; no tuning
FOG (Hernández-Cano et al., 26 May 2025)	18–40%	–	Matches BF16 to 420B tokens
FP8-Flow-MoE (Wang et al., 4 Nov 2025)	21%	up to 16.5 GB	Stable to 200B tokens, MoE
COAT (Xi et al., 25 Oct 2024)	1.43×	1.54–1.65×	<0.5% accuracy loss
TWEO (Liang et al., 28 Nov 2025)	36%	–	Outliers <20; matches BF16
Smooth-SwiGLU (Fishman et al., 19 Sep 2024)	33%	30% (optimizer)	No divergence to 2T tokens

Performance parity is maintained across language (LLMs, GPT, Llama) and vision (ViT, Swin-T/B) tasks. FP8 loss curves track BF16 up to O(10¹²) tokens; downstream metrics (e.g., Wikitext, HellaSwag, ARC) differ by ≤0.5% in zero-shot settings.

6. Practical Implementation Guidelines and Reliability

Practitioners can scale FP8 training by adhering to:

Regular outlier monitoring (peak activation, kurtosis), early-stopping on divergence (Hernández-Cano et al., 26 May 2025, Liang et al., 28 Nov 2025).
Maintaining a hybrid precision for critical modules (norm, final linear, attention score) where FP8 remains fragile (Fujii et al., 10 Nov 2024).
Employing loss regularizers (TWEO) or architectural variants (Smooth-SwiGLU) to suppress mechanically-induced activation blowup (Fishman et al., 19 Sep 2024, Liang et al., 28 Nov 2025).
Leveraging delayed scale updates, dynamic loss scaling, and per-channel scale calibration for optimizers and nonlinear activations (Fishman et al., 19 Sep 2024, Xi et al., 25 Oct 2024).

Robust FP8 training at trillion-token scale is now achievable without extensive hyperparameter tuning, with empirical guarantees for convergence, throughput, and downstream evaluation. Key advances include activation regularization, casting-free dataflows, memory-efficient compression, and principled variance control; collectively these support stable, lossless, and efficient LLM training in highly resource-constrained environments.