Scaling FP8 training to trillion-token LLMs (2409.12517v1)

Published 19 Sep 2024 in cs.LG and cs.AI

Abstract: We train, for the first time, LLMs using FP8 precision on datasets up to 2 trillion tokens -- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a $\sim 34 \%$ throughput improvement.

Authors (4)

Maxim Fishman (5 papers)
Brian Chmiel (15 papers)
Ron Banner (20 papers)
Daniel Soudry (76 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates FP8 training on trillion-token datasets, achieving model performance parity and up to 34% throughput gains.
The paper identifies instability from SwiGLU’s outlier amplification and introduces Smooth-SwiGLU to maintain stable long-duration FP8 training.
The paper pioneers quantizing both Adam optimizer moments to FP8, significantly reducing memory usage and enabling efficient LLM scaling.

Scaling FP8 Training to Trillion-Token LLMs

The paper "Scaling FP8 training to trillion-token LLMs" presents significant advancements in the domain of low-precision training for LLMs. The authors introduce a novel methodology for training LLMs using FP8 precision on datasets as large as 2 trillion tokens, a 20-fold increase compared to prior work, and address emergent critical instabilities in FP8 training observed only in such extensive durations.

The authors identify and analytically examine the instability issues rooted in the SwiGLU activation function, which leads to outlier amplification over prolonged training periods. To mitigate these instabilities, they propose the Smooth-SwiGLU activation function, which preserves the functional behavior of SwiGLU while ensuring stable FP8 training. Additionally, they successfully demonstrate, for the first time, the quantization of both moments in the Adam optimizer to FP8, optimizing memory usage and enhancing training efficiency.

Key Contributions

FP8 Training on Trillion-Token Datasets: The paper achieves a significant leap by training models with FP8 precision on datasets up to 2 trillion tokens. This scale exposes critical instabilities not observable in previous limited-duration studies, thereby advancing our understanding of FP8's applicability in large-scale scenarios.
Identification of Outlier Amplification: The authors trace training instabilities to outlier amplification by the SwiGLU activation function. Through both theoretical and empirical analyses, they link this amplification to the weight alignment process occurring during extended training periods.
Introduction of Smooth-SwiGLU: To counteract the identified issue, the paper introduces Smooth-SwiGLU. This novel activation function modification ensures stable FP8 training without modifying the underlying functional behavior, thus maintaining model performance and enabling efficient large-scale training.
Quantization of Adam Optimizer Moments: The paper breaks new ground by demonstrating the quantization of both Adam optimizer moments to FP8. This optimization reduces memory usage during training, enhancing the overall efficiency of LLM development.
Practical and Theoretical Insights: The findings offer both practical improvements and theoretical insights into training LLMs using FP8 precision. The analysis of weight alignment in SwiGLU and the subsequent introduction of Smooth-SwiGLU contribute to a deeper understanding of activation behaviors in low-precision formats.

Experimental Results

The experimental results underline the efficacy of the proposed methods. The 7B parameter Llama2 model trained using FP8 precision on 256 Intel Gaudi2 accelerators achieves results comparable to the BF16 baseline while delivering up to a 34% improvement in throughput. This enhanced efficiency demonstrates the practical viability of the approach for state-of-the-art LLM training. The zero-shot performance across various downstream tasks confirms that the FP8 models, augmented with Smooth-SwiGLU and FP8 quantized optimizer moments, maintain parity with the BF16 baseline, illustrating the robustness and reliability of the proposed framework.

Implications and Future Directions

From a practical standpoint, this paper provides a pathway to significantly reducing computational resources and memory usage in LLM training. The introduction of Smooth-SwiGLU opens new possibilities for stabilizing low-precision training, potentially applicable to other activation functions and architectures. The successful quantization of both Adam moments to FP8 sets a precedent for similar optimizations across other optimizers, fostering future research into even more efficient training methodologies.

Theoretically, the detailed analysis of weight alignment in SwiGLU enriches the broader understanding of how low-precision arithmetic interacts with activation functions during extended training periods. This insight is invaluable for developing more refined scaling techniques and for further expanding the limitations of low-precision formats.

Looking forward, future research may investigate adaptive versions of Smooth-SwiGLU tailored to various model architectures and further optimize the quantization process for other optimizers beyond Adam. Additionally, exploring the synergy between FP8 precision and other emerging hardware architectures could unlock even greater efficiencies.

In essence, this paper significantly advances the field of LLM training by tackling the challenges of FP8 precision at an unprecedented scale, paving the way for more efficient and scalable AI models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/kalomaze/status/1884454708770922998

https://twitter.com/Grad62304977/status/1877082882646176133

https://twitter.com/SeunghyunSEO7/status/1917037305128489187

https://twitter.com/KhonaMikail/status/1885937748466651533

https://twitter.com/Grad62304977/status/1838308706754200000

https://twitter.com/Grad62304977/status/1843795056236630194

YouTube

Show All Videos

HackerNews

SwiGLU activation function causes instability in FP8 LLM training (10 points, 2 comments)