Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

8 2 513

FP8-LM: Training FP8 Large Language Models (2310.18313v2)

Published 27 Oct 2023 in cs.LG and cs.CL

Abstract: In this paper, we explore FP8 low-bit data formats for efficient training of LLMs. Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 37%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.

References (78)

Authors (20)

Houwen Peng (36 papers)
Kan Wu (42 papers)
Yixuan Wei (16 papers)
Guoshuai Zhao (12 papers)
Yuxiang Yang (91 papers)
Ze Liu (42 papers)
Yifan Xiong (11 papers)
Ziyue Yang (18 papers)
Bolin Ni (11 papers)
Jingcheng Hu (7 papers)
Ruihang Li (3 papers)
Miaosen Zhang (7 papers)
Chen Li (386 papers)
Jia Ning (7 papers)
Ruizhe Wang (24 papers)
Zheng Zhang (488 papers)
Shuguang Liu (5 papers)
Joe Chau (3 papers)
Han Hu (196 papers)
Peng Cheng (229 papers)

Citations (26)

View on Semantic Scholar

Summary

Training FP8 LLMs: Enhancing Efficiency in Memory and Speed

Introduction to FP8 Mixed-Precision Framework

The training of LLMs has been recognized as a formidable task, particularly due to the considerable computational resources and memory required for such endeavors. In the quest to make the training of these models more efficient and less resource-intensive, the move towards low-precision data formats has emerged as a pivotal approach. Against this backdrop, the paper introduces an FP8 automatic mixed-precision framework specifically designed for LLMs. This framework is distinctive in its focus on utilizing 8-bit data formats for variables including gradients and optimizer states during the LLM training process. By implementing this framework, the research demonstrates a notable advancement in reducing memory usage and improving training speed without sacrificing model accuracy or necessitating changes in training hyperparameters.

Key Contributions and Findings

FP8 Mixed-Precision Training Framework: The proposed framework introduces a gradual integration of 8-bit representations into different components of LLM training - from gradients and optimizer states to collective communication and distributed parallel training. This stepwise incorporation aids in reducing memory usage significantly – by 39% in the training of a GPT-175B model on an H100 GPU platform, alongside a substantial increase in training speed by 75% compared to the BF16 mixed-precision framework.
Technological Innovations: Two pivotal techniques, precision decoupling and automatic scaling, are proposed to mitigate the challenges of underflow, overflow, and quantization errors. Precision decoupling involves assigning reduced precision to components that are not sensitive to precision loss, while automatic scaling dynamically adjusts tensor scaling factors to maintain gradient values within FP8's representational range. These innovations address numerical instabilities and ensure the accuracy and stability of LLM training with FP8.
Extended Applicability and Performance: The FP8 framework's utility isn't restricted to pre-training alone; it extends to fine-tuning tasks such as LLM instruction tuning and reinforcement learning with human feedback. The experimental results demonstrate the framework's general applicability across varied LLM tasks and its potential in significant cost-saving without compromising model performance.
Open-Source Contribution: The research team has made their FP8 training framework publicly available, fostering further research and exploration in the domain of efficient LLM training. This open-source contribution is expected to pave the way for widespread adoption and further innovations in low-precision training methodologies.

Practical Implications and Future Directions

The FP8 mixed-precision training framework marks a significant stride towards making the training of large foundational models more resource-efficient. By achieving substantial reductions in memory usage and improvements in training speed, the framework offers a viable solution to the escalating costs associated with LLM training. Furthermore, the open-source release of the FP8 low-precision training framework invites community engagement, potentially leading to advancements in other areas of AI such as multi-modal models and deployment on edge devices.

From a theoretical standpoint, this work underscores the viability of low-precision formats in maintaining training stability and model performance. The successful implementation of FP8 in LLM training could stimulate further research into even lower-bit training formats, potentially revolutionizing the computational efficiency of AI model training.

Conclusion

In summary, this paper introduces a groundbreaking FP8 mixed-precision training framework that not only reduces memory usage and speeds up the training of LLMs but also maintains model accuracy effectively. Through technical innovations and comprehensive evaluation, the work demonstrates the framework's wide applicability and significant potential in cost reduction. Furthermore, by making their framework available to the public, the authors encourage continued innovation in the field, potentially setting a new standard for efficient LLM training.

PDF Markdown

GitHub

GitHub - Azure/MS-AMP: Microsoft Automatic Mixed Precision Library (513 stars)

Tweets

https://twitter.com/willccbb/status/1931445422108881225

https://twitter.com/KyleLiang5/status/1783004325742063993

https://twitter.com/thecharlieblake/status/1800875303486861735

https://twitter.com/AryanPa66861306/status/1888292523011113183

https://twitter.com/test_tm7873/status/1876968430626029918

https://twitter.com/megabeam/status/1926647756006814101

YouTube

Show All Videos