Full-FP8 Training Frameworks
- Full-FP8 training frameworks are end-to-end methods that use 8-bit arithmetic for forward, backward, and optimizer updates with minimal high-precision exceptions.
- They employ specialized quantization, using FP8 formats like E4M3 and E5M2 along with techniques for outlier management and dynamic/static scaling to maintain numerical stability.
- These frameworks integrate fused operators and hardware-accelerated optimizations, achieving significant efficiency improvements and reduced memory footprints in large-scale models.
Full-FP8 Training Frameworks
Full-FP8 training frameworks refer to end-to-end methodologies that perform the entirety of neural network training—forward, backward, and optimizer updates—using 8-bit floating point (FP8) arithmetic, with only selected exceptions for small high-precision “master” state. The recent emergence of robust FP8 support in accelerator hardware enables significant reductions in memory footprint and computational demands for large-scale language, vision, and multimodal models. This article surveys the technical landscape of full-FP8 training frameworks, focusing on quantization methods, outlier management, efficiency benchmarks, optimizer design, and system-level software aspects, while also cataloging current limitations and best practices.
1. FP8 Arithmetic Formats and Quantization Principles
Contemporary FP8 frameworks employ IEEE-like, saturating floating-point formats. The dominant FP8 variants are E4M3 (1 sign, 4 exponent, 3 mantissa bits, exponent bias 7, dynamic range ≃ [–448, +448]) and E5M2 (1 sign, 5 exponent, 2 mantissa bits, exponent bias 15, dynamic range ≃ [–57344, +57344]) (Wang et al., 26 Sep 2025, Peng et al., 2023, Fishman et al., 2024). For training, a typical convention is:
- E4M3 for weights and activations (forward path)
- E5M2 for gradients or optimizer second moments, due to their wider dynamic range
Quantization reduces an FP16/FP32 tensor to FP8 via a scale , often per tensor, block, or group:
The scale is typically selected to maximize dynamic range utilization and prevent overflow, e.g., where is the largest representable FP8 value. Quantization is performed via per-tensor (for large GEMMs), per-block, or mixed-granularity (hybrid) schemes (Wang et al., 26 Sep 2025, Xi et al., 2024).
Hybrid-granularity approaches combine block-wise quantization (e.g., or blocks for weights) with token-wise or group-wise scaling for activations, absorbing local outliers without excessive quantization error (Wang et al., 26 Sep 2025, Xi et al., 2024).
Upward rounding of scales (e.g., UE8M0) clamps scale factors to the next power of two, ensuring noise is bounded and preventing underflow from outliers (Wang et al., 26 Sep 2025).
2. Architecture and Dataflow Adjustments for FP8
To ensure stability and high throughput in full-FP8 workflows, model architectures are adjusted to control activation and gradient distribution. Common interventions include:
- Activation Function Selection: Outlier-amplifying activations (e.g., SwiGLU) are replaced or stabilized (e.g., via Smooth-SwiGLU (Fishman et al., 2024) or with GeLU/xIELU (Hernández-Cano et al., 26 May 2025)) to avoid rare extreme values that would saturate FP8.
- Normalization Placement: Post-residual LayerNorm or RMSNorm (Res-Post-LN) protects the dynamic range at critical boundaries (Wang et al., 26 Sep 2025, Narayan et al., 9 Feb 2025).
- Outlier Suppression: Techniques such as TWEO loss penalize activations exceeding a fixed threshold , e.g., with a term , which reduces catastrophic overflows (Liang et al., 28 Nov 2025).
- Fused and Consistent FP8 Dataflows: MoE frameworks (FP8-Flow-MoE) eliminate redundant quantize-dequantize boundaries by exploiting scaling-aware transpositions, reducing double quantization error (Wang et al., 4 Nov 2025). Fused operators (e.g., combining nonlinearity and quantization steps) further minimize casting overhead.
Delayed or static scaling, where quantization scales are updated every steps or using past maxima, helps avoid repeated costly sweeps and underflow/overflow events (Hernández-Cano et al., 26 May 2025, Fishman et al., 2024).
3. Optimizer and State Quantization Strategies
While forward and backward passes operate in FP8, optimizer states (Adam first and second moments, momentum buffers) are traditionally held in higher precision to avoid update underflow. Several frameworks now quantize these states:
- Dynamic Range Expansion: COAT applies an adaptive nonlinear “companding” transformation to momentum and second moment tensors so that their dynamic range matches FP8 capabilities. Each group is transformed as , storing and inverting with (Xi et al., 2024).
- FP8-Adam: Both and states can be quantized (e.g., to E4M3, to E5M2), with updates performed in FP32 and immediately requantized, as , (Fishman et al., 2024).
- Mixed Precision Per Variable: FP8-LM stores gradients and first moments in FP8, second moments in FP16 for increased robustness (Peng et al., 2023).
Maintaining master FP32 or BF16 weights remains common to preserve the fidelity of small-magnitude updates (Wang et al., 26 Sep 2025, Peng et al., 2023).
Optimizer step hyperparameters are unmodified in most frameworks, since the quantization-induced noise is either negligible or counteracted by stochasticity in large-batch SGD/Adam (Wang et al., 26 Sep 2025, Narayan et al., 9 Feb 2025, Peng et al., 2023).
4. Empirical Results and Efficiency Gains
State-of-the-art full-FP8 training frameworks consistently demonstrate substantial improvements across major efficiency metrics, while essentially matching BF16 or FP16 model quality:
| Framework / Model | Time Reduction | Memory Savings | Throughput Increase | Metric Equivalence |
|---|---|---|---|---|
| InfiR2-1.5B, 7B (Wang et al., 26 Sep 2025) | Up to 22% | Up to 14% | Up to 19% | Within 1–2 pts BF16 on reasoning benchmarks |
| COAT (Llama-7B) (Xi et al., 2024) | Up to 43% | Up to 54% | Up to 57% | Perplexity, downstream tasks within ±0.1% BF16 |
| FP8-Flow-MoE (671B) (Wang et al., 4 Nov 2025) | Up to 21% | 16.5 GB/GPU | Stable at scale | No loss vs. BF16, OOM avoided |
| FP8-LM (175B) (Peng et al., 2023) | 75% | 39% | 1.5–2x | No loss vs. BF16, 0% win-rate difference |
| μnit scaling (Narayan et al., 9 Feb 2025) | Up to 33% | N/A | 1.33x | Parity or improvement over BF16 |
| FALQON (7B LoRA) (Choi et al., 28 Oct 2025) | 3x (LoRA case) | Halved mem | N/A | Within 1–2% of baseline |
Notably, outlier management (TWEO, Smooth-SwiGLU, QK-regularization) is crucial for stability in extended, multi-hundred-billion token training runs or MoE/LoRA regimes (Liang et al., 28 Nov 2025, Fishman et al., 2024, Hernández-Cano et al., 26 May 2025, Choi et al., 28 Oct 2025). With these in place, loss curves and downstream evaluation metrics remain indistinguishable from FP16/BF16 baselines.
5. Software and Hardware Integration
Deployment of full-FP8 training stacks leverages highly-optimized vendor libraries and custom extensions:
- Kernel and GEMM Support: Native FP8 GEMMs with support for blockwise scaling, fused quantization-dequantization, and static scaling factors are available in NVIDIA Transformer Engine, DeepGEMM (Hopper/Blackwell), and Triton-based kernels (Wang et al., 26 Sep 2025, Hernández-Cano et al., 26 May 2025, Xi et al., 2024, Peng et al., 2023).
- Framework Plug-ins: Open-source toolkits such as InfiR2, FP8-LM (MS-AMP), COAT, FP8-Flow-MoE provide PyTorch/Megatron-LM/TransformerEngine/FSDP integration, enabling operator and optimizer replacement via context managers, module hooks, or launcher flags (Wang et al., 26 Sep 2025, Peng et al., 2023, Xi et al., 2024, Wang et al., 4 Nov 2025).
- Distributed Training: Communication overhead is mitigated by FP8 all-reduce and parameter exchange, yielding up to 2× lower inter-GPU bandwidth (Wang et al., 26 Sep 2025, Peng et al., 2023).
- Checkpointing: Only minimal master state is stored in BF16/FP32 for checkpointing and logging; all large tensors remain in FP8 throughout the pipeline.
Practical recommendations include fusing quantization into preceding compute kernels, recomputing static scales at intervals, and always monitoring for potential overflows/underflows via runtime diagnostics (Hernández-Cano et al., 26 May 2025, Xi et al., 2024).
6. Specializations: Mixture-of-Experts, Reinforcement Learning, and Fine-Tuning
- Mixture-of-Experts (MoE): FP8-Flow-MoE achieves casting-free expert paths using scaling-aware transposes and fused kernels, reducing per-layer quantization boundaries from 12 to 2 and maintaining lossless convergence at 671B scale (Wang et al., 4 Nov 2025).
- Reinforcement Learning: Jet-RL demonstrates that robust FP8 RL optimization requires strictly on-policy precision flow; hybrid BF16 training with FP8 rollouts introduces critical instability under long-horizon rollouts, an issue resolved only when both actors use unified quantization (Xi et al., 20 Jan 2026).
- Low-Rank Adaptation (LoRA): FALQON merges LoRA adapters directly into the FP8 backbone at initialization, avoiding the overheads of small-matrix quantization and enabling a single quantization pass per forward/backward, producing 3× speedup in LoRA fine-tuning for LLMs (Choi et al., 28 Oct 2025).
7. Limitations, Stability Hazards, and Best Practices
Despite robustness advances, several points require consideration:
- Activation Outliers: Emergence of extreme outliers (e.g., due to SwiGLU alignment) can destabilize FP8 training unless mitigated through tailored architectural or loss interventions (Fishman et al., 2024, Liang et al., 28 Nov 2025, Hernández-Cano et al., 26 May 2025).
- Optimizer Quantization: Full FP8-quantized optimizers require nontrivial adaptive scaling (companding, E4M3/E5M2 split), and moment underflow can degrade adaptation if not handled with care (Xi et al., 2024, Fishman et al., 2024).
- Dynamic Versus Static Scaling: Dynamic per-tensor scaling improves robustness but can incur kernel/data movement overhead; static scaling (μnit, unit) offers maximum speed and direct hyperparameter transfer but is less robust in settings with drifting activation distributions (Narayan et al., 9 Feb 2025, Blake et al., 2023).
- Hardware Requirements: Realizing theoretical speedups and memory savings necessitates native FP8 GEMM, accumulation, and reduced-precision memory bandwidth—only current-generation accelerators (NVIDIA H100/Blackwell, Intel Gaudi2) provide full support (Wang et al., 26 Sep 2025, Fishman et al., 2024, Zhang et al., 2023).
- Distributed/Communication Protocols: Accurate, low-overhead partitioning of FP8 tensors and their scale factors is critical for ZeRO, FSDP, and large-scale tensor/sequence parallel (Peng et al., 2023, Xi et al., 2024).
Common best practices include maintaining high-precision master weights, performing activation outlier monitoring, employing upward-rounded scales for safety, and ensuring strict precision matching between training and rollout in RL settings (Wang et al., 26 Sep 2025, Xi et al., 2024, Xi et al., 20 Jan 2026).
References:
- InfiR2: "InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced LLMs" (Wang et al., 26 Sep 2025)
- TWEO: "TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies" (Liang et al., 28 Nov 2025)
- Jet-RL: "Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow" (Xi et al., 20 Jan 2026)
- FP8-Flow-MoE: "FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error" (Wang et al., 4 Nov 2025)
- COAT: "COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training" (Xi et al., 2024)
- μnit Scaling: "nit Scaling: Simple and Scalable FP8 LLM Training" (Narayan et al., 9 Feb 2025)
- Scaling FP8 to trillion tokens: "Scaling FP8 training to trillion-token LLMs" (Fishman et al., 2024)
- FP8-LM: "FP8-LM: Training FP8 LLMs" (Peng et al., 2023)
- FALQON: "FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic" (Choi et al., 28 Oct 2025)
- Unit Scaling: "Unit Scaling: Out-of-the-Box Low-Precision Training" (Blake et al., 2023)
- Flexible FP8: "Exploring the Potential of Flexible 8-bit Format: Design and Algorithm" (Zhang et al., 2023)