Towards Fully FP8 GEMM LLM Training at Scale (2505.20524v1)

Published 26 May 2025 in cs.LG

Abstract: Despite the significant potential of FP8 data formats for LLM pre-training, their adoption has been limited due to challenges in maintaining stability at scale. Existing approaches often rely on suboptimal fine-grained FP8 kernels or fall back to higher-precision matrix multiplications (GEMMs) in sensitive components, such as attention projections, compromising potential throughput gains. We introduce a new class of LLM architectures that, for the first time, support FP8 computation for all GEMMs within transformer blocks during both forward and backward passes. This enables unprecedented throughput gains, particularly at scale, while matching the downstream performance of standard BF16 training. Our architecture design reduces large outlier activations, promoting stable long-term FP8 training. In addition, we identify key metrics to monitor low-precision training and predict potential future divergences.

Summary

The paper introduces FOG (Fast and Outlier-Guarded) architectures to enable fully FP8 GEMM computations within transformer blocks, including attention, for large language model training.
This approach achieves up to 40% throughput improvements over BF16 and matches downstream performance across various model scales without falling back to higher precision.
The FOG architectures demonstrate robustness in long data regimes and set a precedent for exploring efficient, purely FP8 pre-training frameworks for LLMs.

Overview of FP8 GEMM LLM Training

The paper "Towards Fully FP8 GEMM LLM Training at Scale" presents a significant advancement in the optimization of LLMs using 8-bit floating point (FP8) precision for General Matrix Multiplications (GEMMs). The paper addresses the crucial challenge of training transformer-based LLMs at scale using FP8 data formats, which offer substantial throughput improvements while maintaining model performance comparable to higher precision formats like BF16.

Problem Statement and Motivation

LLMs are known for their massive computational requirements, often translating into millions of GPU hours for training using extensive datasets. In order to mitigate these resource demands, the use of lower precision number formats like FP8 has been proposed. FP8 offers advantages in terms of reduced compute cost; however, its adoption has faced obstacles due to difficulties in maintaining training stability, particularly at scale. FP8 formats are vulnerable to overflows and underflows, and previous methods have required suboptimal fine-grained computation or fallback to higher precision for sensitive components such as attention mechanisms.

Proposed Solution: FOG Architectures

The paper introduces FOG (Fast and Outlier-Guarded) architectures specifically engineered to enable fully FP8 computations within transformer blocks during both forward and backward passes, including the attention mechanism. This holistic approach to FP8 GEMMs allows for unprecedented 40% throughput improvements and matches the downstream performance with BF16 training without falling back to higher precision in attention computations.

Key innovations include:

Reduction of Activation Outliers: The architecture design minimizes large outlier activations that are common in LLMs and problematic for FP8 precision. This fosters long-term stability in FP8 training.
Comprehensive Monitoring and Prediction: The authors identify key metrics, such as kurtosis of activations, to monitor low-precision training effectiveness and predict potential instability earlier in the training process.

Experimental Results

Extensive experiments were conducted showing the viability of FOG architectures across various model scales. For instance, in 8 billion parameter models, the approach achieves up to 40% throughput gains over BF16 approaches. The authors demonstrate convergence without the typical divergence seen in FP8 approaches at extreme scales, reporting equivalent loss progression and downstream performance on benchmark tasks such as HellaSwag, PIQA, and ARC.

Long-Term Training: A notable experiment involved scaling a 1.5 billion parameter model to train across 420 billion tokens, far exceeding the Chinchilla-optimal data budget, illustrating robustness in long data regimes.
Divergence Analysis: The paper discusses observations regarding the tendency of larger models to diverge later in FP8DPA training stages, an insight which opens pathways for future research.

Implications and Future Directions

This paper makes definitive steps towards efficient FP8 GEMM training for LLMs, potentially lowering computational barriers associated with large model training. It sets a precedent for exploring purely FP8 pre-training frameworks comprehensively, moving towards highly efficient architectures that rarely require higher precision fallbacks.

The implications of this work are promising for both practical and theoretical AI advancements. It sets the stage for more robust and scalable applications of LLMs across domains by reducing compute costs, offering a blueprint for other architectures that may leverage similar methodologies in low precision computation. Future work could address FP8 tensor precision in optimizer states for additional memory savings or apply insights from activation kurtosis to novel transformer variants, expanding upon the base established by FOG architectures.

In summary, the paper contributes a meaningful advancement in AI by optimizing training methodologies at scale, reducing computational overhead while ensuring stable and efficient performance of LLMs in FP8 precision formats.

Tweets

https://twitter.com/DhiaGarbaya/status/1928510719961505817