FP8 GEMM LLM Training

Updated 5 October 2025

FP8 GEMM LLM Training is a technique that uses 8-bit floating point operations to enhance throughput and resource efficiency in large language model training.
It employs tailored quantization strategies and compiler/kernel optimizations to manage reduced dynamic range and mitigate outlier activation challenges.
The approach integrates mixed-precision schemes and stability monitoring to ensure robust performance and improved memory efficiency on modern hardware.

General Matrix Multiplication (GEMM) in 8-bit floating point (FP8) precision has become pivotal in the pursuit of high-throughput, resource-efficient training of LLMs. The transition to FP8 is motivated by both theoretical advantages—heightened compute and bandwidth efficiency—as well as recent hardware developments supporting 8-bit floating-point arithmetic. However, the reduced dynamic range, limited mantissa bits, and tight tolerance to outlier activations introduce new optimization, algorithmic, and stability challenges. The state-of-the-art addresses these with a combination of quantization strategies, architectural modifications, stability monitoring, and compiler or kernel-level optimizations, enabling, for the first time, end-to-end FP8 GEMM LLM training at scale.

1. FP8 Numeric Formats and Quantization Strategies

FP8 is not a monolithic number system but a family of floating-point representations, commonly parameterized by the number of exponent and mantissa bits. The principal standards in LLM practice are E4M3 (1 sign, 4 exponent, 3 mantissa bits) and E5M2 (1, 5, 2). The choice between them reflects a fundamental trade-off: E5M2 offers broader dynamic range but less precision; E4M3 provides denser grid points but smaller representable value intervals. The correct balance depends on the statistical distribution of weights and activations:

FP8 Type	Exponent Bits	Mantissa Bits	Dynamic Range	Max Representable Value
E5M2	5	2	Highest	~57,344
E4M3	4	3	Intermediate	~448
E3M4	3	4	Lower	~30

LLMs are characterized by heavy-tailed, outlier-prone activations, particularly post-normalization. Empirical benchmarks confirm that FP8 outperform INT8 quantization in both workload coverage and accuracy for LLMs and other tasks, achieving, for instance, 92.64% coverage with E4M3 compared to 65.87% for INT8 (Shen et al., 2023).

Quantization involves dynamic or static scaling factors, often calculated per-tensor, per-channel, or per-token. Static scaling offers throughput advantages but is more susceptible to outlier-induced error; dynamic scaling (including just-in-time and delayed scaling) more closely tracks evolving distributional shifts during training or inference and is especially protective in attention mechanisms and during training where kurtosis may spike (Hernández-Cano et al., 26 May 2025, Fishman et al., 19 Sep 2024). Hybrid strategies—block-wise scaling for weights and finer granularity for activations—can be used to maximize numerical fidelity and hardware alignment (Wang et al., 26 Sep 2025).

Stochastic rounding, which randomly selects between two nearest grid points with probability proportional to proximity, provides minimal empirical benefit over nearest rounding in most settings (Kim et al., 3 Feb 2025).

2. Compiler-Level and Kernel Optimization for FP8 GEMM

Transitioning GEMM to FP8 requires both algorithmic and low-level software innovation to realize its theoretical throughput benefits. Compiler-level optimization, as implemented in TVM, leverages search-based methods (e.g., G-BFS, N-A2C) to identify tile sizes and kernel configurations that minimize execution time under memory and hardware constraints (Zhang et al., 2019). These methods save up to 24–40% computation time versus prior learned-tuner baselines while exploring only 0.1% of the configuration space.

At the kernel level, FP8 GEMM kernels exploit modern accelerator tensor cores capable of natively computing FP8 multiplications. For mixed-precision architectures (e.g., RedMulE), internal accumulation occurs in FP16 to capture reduced quantization error, with casting units converting FP8 to FP16 (and vice versa) at input/output boundaries (Tortorella et al., 2023). These designs improve utilization, double effective memory bandwidth relative to FP16, and drive energy efficiencies approaching 1.2–1.7 TFLOPS/W.

Padding requirements in grouped GEMM for Mixture-of-Experts architectures traditionally incur substantial memory and computational overhead. The TMA-Adaptive FP8 Grouped GEMM approach eliminates the need for padding by provisioning a pool of TMA descriptors indexed to residual group sizes and employing dual-phase load-store operations to maintain bitwise numerical equivalence and full memory alignment (Su et al., 7 Aug 2025). This method streamlines throughput by 1.7–20.4% and reduces peak memory up to 23.8%.

3. Architectures, Stability, and Monitoring for Robust FP8 GEMM Training

End-to-end FP8 training at scale has historically been hampered by occasional catastrophic divergence, especially in the presence of outlier amplifying operations (e.g., SwiGLU activations). Extended training reveals that quadratic growth in SwiGLU output, driven by weight vector alignment and amplified by regularization, can cause activation spikes that breach FP8’s limited dynamic range (Fishman et al., 19 Sep 2024).

Stabilization strategies include:

Architecture Modifications: FOG (Fast and Outlier-Guarded) architectures systematically remove pre-normalization blocks, freeze or regularize QK RMSNorm gains, apply extra normalization using RMSNorm/tanh in attention, scale inputs to the transformer block for unit variance, and introduce post-normalization (e.g., LayerScale) before residuals (Hernández-Cano et al., 26 May 2025).
Smooth-SwiGLU: Per-channel scaling is introduced before quantization and then undone post-quantization, reducing the risk of activation spikes without altering function semantics (Fishman et al., 19 Sep 2024).

Kurtosis monitoring is used as an early-warning metric: average kurtosis increases in QKV or block outputs precede observable loss or gradient norm explosions by substantial token intervals (Hernández-Cano et al., 26 May 2025). Loss landscape sharpness metrics, such as

$\phi_\epsilon = \frac{\max_{z_s \in \mathcal{C}_\epsilon} f(y_s + z_s) - f(y_s)}{1 + f(y_s)} \times 100,$

quantify risk of divergence before global instability manifests (Lee et al., 29 May 2024).

4. Full-Pipeline and Mixed-Precision FP8 Training

Modern FP8 training frameworks target not just GEMMs but also optimizer states, activation storage, and distributed communication. FP8-LM introduces an incremental, three-tiered approach:

Gradients communicated among GPUs are quantized into FP8 after per-tensor scaling and aggregation, reducing communication bandwidth and volume (Peng et al., 2023).
Optimizer states are “precision-decoupled”, storing first moments in FP8 and second moments in FP16; master weights are typically retained as FP16, as weight updates are most susceptible to rounding error.
Distributed parallelism (tensor, pipeline, and sequence) and ZeRO-sharded communication are adapted to FP8 tensors plus their scale factors, further streamlining memory use and scaling efficiency.

Compression frameworks such as COAT further quantize optimizer states and activations into FP8 by using dynamic range expansion (exponentiation per group to match the distribution’s dynamic range to FP8’s) and mixed-granularity quantization (per-group for nonlinear activations, per-tensor for linear layers), thereby reducing end-to-end training memory by 1.54× and delivering training speedups of 1.43× relative to BF16 (Xi et al., 25 Oct 2024).

5. Algorithmic and Mathematical Principles

FP8 quantization exploits the floating-point number construction:

$f = (-1)^s \cdot (1 + \sum_{i=1}^m d_i 2^{-i}) \cdot 2^{p-b}$

with $s$ the sign, $m$ mantissa bits, $p$ the exponent, and $b$ its bias. The scaling factor $S$ is set dynamically via $S = \max(|X|)/V_{max}$ for each tensor or channel. To control quantization error under block-wise or token-wise schemes, rounding to the nearest power of 2 (as in UE8M0 scaling) further minimizes quantization noise (Wang et al., 26 Sep 2025).

IM-Unpack addresses outlier “heavy-hitters” by decomposing large values into a sum of base- $s$ components, enabling all-GEMM operations to remain in low precision while reconstructing the exact product:

$v = m(v,0) + s\cdot m(v,1) + s^2\cdot m(v,2) + \dots ~~~\text{where}~~~ m(v,i)=\left\lfloor \frac{v}{s^i} \right\rfloor \bmod s$

This decomposition allows compositions of row/column or both unpacking methods to keep all GEMM entries within FP8 bounds, with minimal overhead (Zeng et al., 12 Mar 2024).

Mixed-precision GEMM kernels (e.g., RedMulE, FireQ) operate by loading FP8 or INT4 encoded weights into register/block-local FP16 or FP32 units, performing accumulation at higher precision, and writing back results in FP8 after applying scaling and lossless epilogue reduction steps (Tortorella et al., 2023, 2505.20839).

6. Application Benchmarks, Hardware Support, and Cost Analysis

Experimental evidence from production-scale LLMs demonstrates throughput and efficiency gains:

Up to 75% faster training in FP8 versus BF16 (Megatron-LM baseline) on H100 GPUs, with 39% real memory savings for 175B-parameter models, and training speedups over Nvidia Transformer Engine by 37% (Peng et al., 2023).
In TinyML, RedMulE’s FP8 achieves 1.67 TFLOPS/W, 117 GFLOPS at 613MHz with 99.4% CE utilization (Tortorella et al., 2023).
TMA-Adaptive FP8 Grouped GEMM eliminates up to 23.8% memory and up to 20.4% compute overhead by obviating group padding (Su et al., 7 Aug 2025).

Hardware platforms like NVIDIA H100 and Intel Gaudi 2 offer native support for FP8 arithmetic with high bandwidth and specialized scaling or accumulator paths (FP32 in Gaudi 2). FP8 enables a greater token throughput per watt, with observed 1.5–2× speedup in thin GEMM-dominated decode phases for both LLM training and inference. Power and cooling implications directly affect total cost of ownership, with shifting workload characteristics (e.g., thin GEMMs) acting as bottlenecks for future accelerator hardware design (Kim et al., 3 Feb 2025).

7. Limitations, Trade-offs, and Future Directions

Despite quantifiable gains, FP8 training introduces new instabilities:

Reduced exponent bits severely limit training robustness across seeds, hyperparameters, and datasets. Large models (e.g., 175B) experience more frequent loss divergence and sharpness spikes when the learning rate is increased, particularly in FP8 or in simulated E8M3/E8M4 variants (Lee et al., 29 May 2024).
Downstream performance, particularly for code generation or mathematical reasoning, may suffer more in FP8 than in BF16, with increased frequency of instability and accuracy loss (Fujii et al., 10 Nov 2024).
Ongoing research explores adaptive precision control, hybrid schemes, staged training, and improved loss landscape monitoring to blend FP8 speed with higher-precision stability when necessary (Lee et al., 29 May 2024).

Full-pipeline FP8 training recipes employing hybrid-granularity quantization strategies and outlier-aware scaling now reach lossless or near-lossless performance relative to BF16 on suites of reasoning benchmarks, with up to 22% reduction in training time, 14% peak memory savings, and 19% throughput improvement (Wang et al., 26 Sep 2025). Code releases and model checkpoints are facilitating broader adoption and transparency.

Future work will further refine per-layer precision assignment, optimize block granularity, advance kernel fusion for memory-constrained scenarios, and synchronize algorithmic developments with the next generation of FP8-enabled hardware.