End-to-End FP8 Training Pipeline

Updated 15 April 2026

End-to-End FP8 Training Pipelines are comprehensive frameworks that enable neural network training entirely in 8-bit precision, optimizing forward, backward, and optimizer stages.
They integrate novel quantization schemes, fused FP8 GEMM operations, and specialized stability methods to achieve near lossless accuracy relative to BF16 baselines.
Practical applications include large-scale language models, vision transformers, federated learning, and reinforcement learning, showcasing broad impact in deep learning.

End-to-end FP8 training pipelines are comprehensive frameworks that enable neural network training, including forward, backward, and optimizer stages, to be carried out primarily or entirely in 8-bit floating point (FP8) precision. Recent advances have overcome critical limitations in numerical stability, hardware kernel coverage, and activation outlier management. End-to-end FP8 approaches now support large-scale LLMs, vision transformers (ViTs), federated learning, and reinforcement learning. State-of-the-art FP8 training pipelines combine architectural innovations, fine-grained quantization, optimizer quantization, and targeted stability solutions, yielding substantial throughput and memory improvements at near lossless accuracy with respect to BF16 baselines.

1. FP8 Numerics, Formats, and Quantization Schemes

The two principal FP8 formats in deep learning are E4M3 (1 sign, 4 exponent, 3 mantissa bits; exponent bias 7, range ±448) and E5M2 (1 sign, 5 exponent, 2 mantissa bits; bias 15, range ±65,536). Format selection is application-driven: E4M3 is preferred for weights and activations due to wider mantissa; E5M2 for gradients and optimizer states requiring broader dynamic range (Xi et al., 2024, Fishman et al., 2024, Liang et al., 28 Nov 2025).

Quantization is accomplished by computing a scaling factor $S = \max|X|/V_{\mathrm{max}}$ , where $V_{\mathrm{max}}$ is the largest representable FP8 value. The quantizer is

$\widehat{X} = \mathrm{round}(X/S)$

for per-tensor, per-block, or per-group granularities. Quantized values are stored in FP8, with dequantization and rescaling on-the-fly as needed.

Delayed scaling (maintaining a windowed history for amax updates) and power-of-two scale representation (UE8M0) are often employed to ensure robust dynamic range and overflow avoidance (Wang et al., 26 Sep 2025, Liang et al., 28 Nov 2025).

2. Full-coverage FP8 GEMM and Dataflow Integration

Efficient end-to-end FP8 training requires that all major matrix multiplications (GEMMs) in the model—including MLPs, attention projections, MoE experts, and embedding lookups—execute directly in FP8 precision during both forward and backward passes. Leading pipelines such as (Hernández-Cano et al., 26 May 2025) and InfiR2 (Wang et al., 26 Sep 2025) implement blockwise (typically $128 \times 128$ ) or per-tensor weight quantization and per-token (row-wise) activation quantization using DeepGEMM or equivalent kernels for combined FP8 $\otimes$ FP8 $\to$ FP16 accumulation.

Non-GEMM operations—LayerNorm, softmax, bias-add, etc.—run in FP32, with type casting confining quantization to linear operations. Intermediate activations are stored quantized in FP8 to minimize memory overhead and accelerate the backward pass. In typical training iterations, weights and optimizer states reside in FP32 master copies and are re-quantized at each step (Hernández-Cano et al., 26 May 2025, Fishman et al., 2024, Wang et al., 26 Sep 2025).

3. Stability, Activation Outliers, and Architectural Solutions

Naive FP8 pipelines are highly sensitive to activation outliers, which induce overflow and catastrophic instability, especially with SwiGLU/GLU activations or over prolonged training. Two major approaches address this:

Architectural modifications: Fishman et al. introduce Smooth-SwiGLU, applying per-channel rescaling to contain the quadratic outlier amplification observed in standard SwiGLU. The re-scaled output is quantized so activations stay within FP8’s representational limits (Fishman et al., 2024).
Loss-based regularization: TWEO (Liang et al., 28 Nov 2025) introduces a non-invasive outlier-penalty loss term that indirectly regularizes weight colinearity and suppresses activation outliers without architectural changes. This effectively reduces outlier magnitudes to less than 20 (from $10^4$ – $10^5$ ), enabling all modules to remain in FP8 precision.

Both mechanisms ensure stable convergence of full-model FP8 training, matching BF16 accuracy even for large LLMs and ViTs. Without such solutions, fallback to BF16/FP32 for sensitive components negates FP8’s efficiency (as in prior attempts (Hernández-Cano et al., 26 May 2025)).

4. Optimizer and Gradient Quantization Strategies

Optimizer state quantization is critical for memory and bandwidth scaling. The most robust method, as demonstrated in COAT (Xi et al., 2024) and Fishman et al. (Fishman et al., 2024), stores both first- (momentum $m$ ) and second-order (variance $v$ ) Adam moments in FP8 (E4M3 and E5M2, respectively). Prior to each update, FP8 states are dequantized to FP32, the AdamW update is computed, and the new moments are quantized back into FP8. COAT further proposes dynamic range expansion, using a power-law transform to maximize representational use of FP8 codes, reducing quantization error by up to $V_{\mathrm{max}}$ 0 (Xi et al., 2024).

Backward flow employs mixed-precision GEMMs but accumulates gradients in FP32 prior to optimizer step. Dynamic or static loss scaling and gradient norm clipping in FP32 complete the stabilization toolkit (Wang et al., 26 Sep 2025, Xi et al., 2024).

5. Advanced Architectures: MoE, RL, and Federated Learning

End-to-end FP8 training has expanded beyond standard LLMs to:

MoE Models: FP8-Flow-MoE (Wang et al., 4 Nov 2025) uses scaling-consistent dataflows and scaling-aware transposes to eliminate double quantization error across expert and gate tensors, reducing explicit cast operations and maintaining high throughput and convergence on 671B-parameter MoE models.
Reinforcement Learning LLMs: FP8-RL (Qiu et al., 26 Jan 2026) and Jet-RL (Xi et al., 20 Jan 2026) enforce a unified FP8 precision flow for both training and rollout, eliminating policy mismatches that otherwise destabilize long-horizon generation. Granular (block or group-level) quantization of weights, activations, gradients, and the KV-cache is coordinated between learner and actor engines, with precise scale recalibration and importance sampling for loss correction.
Federated Learning: Decentralized FP8 training (Wang et al., 2024) quantizes client weights and activations with per-tensor scales, maintains master copies locally, and uses unbiased stochastic quantization for client-server communication, achieving $V_{\mathrm{max}}$ 1 communication savings with FedAvg-like convergence and <1% accuracy loss.

6. Empirical Performance and Hardware Integration

End-to-end FP8 pipelines exhibit substantial training acceleration and memory reduction in both single-node and distributed settings:

Reference	Throughput Gain vs. BF16	Memory Gain	Tasks/Models
InfiR2 (Wang et al., 26 Sep 2025)	+19%	–14%	LLM pretraining (160B tokens), SFT, AIME25/24, GPQA
COAT (Xi et al., 2024)	+43%	–1.54×	Llama-2-13B, VLMs, FSDP/ZeRO distributed
TWEO (Liang et al., 28 Nov 2025)	+36%	n/a	GPT2 (124M...1.6B), Swin-T/S, ViT-B
Fishman et al. (Fishman et al., 2024)	+33%	–30%	LLaMA2-7B (2T tokens), on 256 Gaudi2 HPUs
FP8-Flow-MoE (Wang et al., 4 Nov 2025)	+21%	–16.5 GB	671B-param MoE, DeepSeek-V2-Lite 16B
Jet-RL (Xi et al., 20 Jan 2026)	+16–33%	n/a	LLM RL: GSM8K, MATH500, DeepMATH, long-rollout

Integration makes use of NVIDIA Transformer Engine, DeepGEMM kernels, Megatron-LM, and in the case of Fishman et al., Intel Gaudi2 native FP8 support (Fishman et al., 2024, Liang et al., 28 Nov 2025, Wang et al., 26 Sep 2025). Delayed scaling buffers and fused quantization-dequantization primitives enable seamless hardware utilization.

7. Practical Guidelines, Monitoring, and Best Practices

Best practices established by recent work include:

Quantization granularity: Hybrid schemes (blockwise for weights, per-token/group for activations) achieve near-BF16 accuracy with optimal efficiency (Wang et al., 26 Sep 2025, Xi et al., 2024, Xi et al., 20 Jan 2026).
Stability monitoring: Track key metrics such as peak activation magnitude and 'amplification index' for early divergence prediction (Hernández-Cano et al., 26 May 2025).
Loss scaling and clipping: Use static or dynamic loss scaling (e.g., $V_{\mathrm{max}}$ 2) and clip per-layer gradient norm to stabilize training.
Outlier handling: Apply TWEO or Smooth-SwiGLU for robust outlier suppression as soon as the activation distribution drifts (Liang et al., 28 Nov 2025, Fishman et al., 2024).
Optimizer states: Always quantize after update (AdamW) and dequantize before use; dynamic range expansion if groupwise quantization is employed (Xi et al., 2024).
Fused kernels and delayed scaling: Use fused GEMM+quant, all-in-FP8 computation, and delayed scaling amax updates to maximize throughput (Wang et al., 4 Nov 2025, Wang et al., 26 Sep 2025).
Software integration: Tools such as COAT, FP8-Flow-MoE, and InfiR2 are implemented as wrappers or drop-in modules for PyTorch and TransformerEngine (Xi et al., 2024, Wang et al., 4 Nov 2025, Wang et al., 26 Sep 2025).

In conclusion, end-to-end FP8 training pipelines have matured into deployable, numerically-stable systems for LLMs, MoE, RL, vision models, and federated learning. They deliver significant reductions in computation time, model footprint, and memory bandwidth, while maintaining downstream performance within 0.1–2% of traditional BF16 pipelines, provided appropriate mitigations for activation outliers and optimizer state quantization are employed (Hernández-Cano et al., 26 May 2025, Fishman et al., 2024, Liang et al., 28 Nov 2025, Xi et al., 2024, Wang et al., 26 Sep 2025, Wang et al., 4 Nov 2025, Qiu et al., 26 Jan 2026, Xi et al., 20 Jan 2026, Wang et al., 2024).