- The paper demonstrates FP8 training on trillion-token datasets, achieving model performance parity and up to 34% throughput gains.
- The paper identifies instability from SwiGLU’s outlier amplification and introduces Smooth-SwiGLU to maintain stable long-duration FP8 training.
- The paper pioneers quantizing both Adam optimizer moments to FP8, significantly reducing memory usage and enabling efficient LLM scaling.
Scaling FP8 Training to Trillion-Token LLMs
The paper "Scaling FP8 training to trillion-token LLMs" presents significant advancements in the domain of low-precision training for LLMs. The authors introduce a novel methodology for training LLMs using FP8 precision on datasets as large as 2 trillion tokens, a 20-fold increase compared to prior work, and address emergent critical instabilities in FP8 training observed only in such extensive durations.
The authors identify and analytically examine the instability issues rooted in the SwiGLU activation function, which leads to outlier amplification over prolonged training periods. To mitigate these instabilities, they propose the Smooth-SwiGLU activation function, which preserves the functional behavior of SwiGLU while ensuring stable FP8 training. Additionally, they successfully demonstrate, for the first time, the quantization of both moments in the Adam optimizer to FP8, optimizing memory usage and enhancing training efficiency.
Key Contributions
- FP8 Training on Trillion-Token Datasets: The paper achieves a significant leap by training models with FP8 precision on datasets up to 2 trillion tokens. This scale exposes critical instabilities not observable in previous limited-duration studies, thereby advancing our understanding of FP8's applicability in large-scale scenarios.
- Identification of Outlier Amplification: The authors trace training instabilities to outlier amplification by the SwiGLU activation function. Through both theoretical and empirical analyses, they link this amplification to the weight alignment process occurring during extended training periods.
- Introduction of Smooth-SwiGLU: To counteract the identified issue, the paper introduces Smooth-SwiGLU. This novel activation function modification ensures stable FP8 training without modifying the underlying functional behavior, thus maintaining model performance and enabling efficient large-scale training.
- Quantization of Adam Optimizer Moments: The paper breaks new ground by demonstrating the quantization of both Adam optimizer moments to FP8. This optimization reduces memory usage during training, enhancing the overall efficiency of LLM development.
- Practical and Theoretical Insights: The findings offer both practical improvements and theoretical insights into training LLMs using FP8 precision. The analysis of weight alignment in SwiGLU and the subsequent introduction of Smooth-SwiGLU contribute to a deeper understanding of activation behaviors in low-precision formats.
Experimental Results
The experimental results underline the efficacy of the proposed methods. The 7B parameter Llama2 model trained using FP8 precision on 256 Intel Gaudi2 accelerators achieves results comparable to the BF16 baseline while delivering up to a 34% improvement in throughput. This enhanced efficiency demonstrates the practical viability of the approach for state-of-the-art LLM training. The zero-shot performance across various downstream tasks confirms that the FP8 models, augmented with Smooth-SwiGLU and FP8 quantized optimizer moments, maintain parity with the BF16 baseline, illustrating the robustness and reliability of the proposed framework.
Implications and Future Directions
From a practical standpoint, this paper provides a pathway to significantly reducing computational resources and memory usage in LLM training. The introduction of Smooth-SwiGLU opens new possibilities for stabilizing low-precision training, potentially applicable to other activation functions and architectures. The successful quantization of both Adam moments to FP8 sets a precedent for similar optimizations across other optimizers, fostering future research into even more efficient training methodologies.
Theoretically, the detailed analysis of weight alignment in SwiGLU enriches the broader understanding of how low-precision arithmetic interacts with activation functions during extended training periods. This insight is invaluable for developing more refined scaling techniques and for further expanding the limitations of low-precision formats.
Looking forward, future research may investigate adaptive versions of Smooth-SwiGLU tailored to various model architectures and further optimize the quantization process for other optimizers beyond Adam. Additionally, exploring the synergy between FP8 precision and other emerging hardware architectures could unlock even greater efficiencies.
In essence, this paper significantly advances the field of LLM training by tackling the challenges of FP8 precision at an unprecedented scale, paving the way for more efficient and scalable AI models.