Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FP8-LM: Training FP8 Large Language Models (2310.18313v2)

Published 27 Oct 2023 in cs.LG and cs.CL

Abstract: In this paper, we explore FP8 low-bit data formats for efficient training of LLMs. Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 37%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  3. Microsoft Bing. Bing webmaster tools. 2022. URL https://www.bing.com/webmasters/.
  4. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  5. Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, 2022.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  7. PaLM: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022.
  8. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  11. 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations, 2021.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
  13. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022.
  14. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  15. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  16. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  17. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  18. Training compute-optimal large language models. arXiv:2203.15556, 2022.
  19. Binarized neural networks. Advances in neural information processing systems, 29, 2016.
  20. HuggingFace. wikipedia - datasets at hugging face. 2022. URL https://huggingface.co/datasets/wikipedia.
  21. Data movement is all you need: A case study on optimizing transformers. Proceedings of Machine Learning and Systems, 3:711–732, 2021.
  22. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, 2017.
  23. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, San Diego, CA, 2015. URL http://arxiv.org/abs/1412.6980.
  24. The stack: 3 tb of permissively licensed source code. Transactions on Machine Learning Research, 2022.
  25. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  26. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, 2022.
  27. Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, pages 766–775, 2023a.
  28. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  29. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1, 2021.
  30. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  31. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
  32. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  33. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  34. Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
  35. Microsoft. Azure high-performance computing. 2023. URL https://azure.microsoft.com/en-us/solutions/high-performance-computing.
  36. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018.
  37. Nvidia. Apex. 2018. URL https://nvidia.github.io/apex.
  38. Nvidia. The nvidia collective communications library. 2020. URL https://developer.nvidia.com/nccl.
  39. Nvidia. Nvidia h100 tensor core gpu architecture. 2022a. URL https://resources.nvidia.com/en-us-tensor-core.
  40. Nvidia. Nvidia transformer engine. 2022b. URL https://docs.nvidia.com/deeplearning/transformer-engine/index.html.
  41. Nvidia. Using fp8 with transformer engine. 2022c. URL https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html.
  42. OpenAI. Model index for researchers. 2022. URL https://platform.openai.com/docs/model-index-for-researchers.
  43. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  44. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, 2016.
  45. Shawn Presser. Books3. https://twitter.com/theshawwn/status/1320282149329784833, 2020.
  46. Language models are unsupervised multitask learners. 2019.
  47. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2019.
  48. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  49. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  50. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  51. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  52. Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
  53. Redpajama. Redpajama-data: an open source recipe to reproduce llama training dataset. 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  54. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium, 2011.
  55. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  56. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2018.
  57. BLOOM: A 176B-parameter open-access multilingual language model. ArXiv, abs/2211.05100, 2022.
  58. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  59. ShareGPT. Openchat: Advancing open-source language models with imperfect data. 2023. URL https://sharegpt.com/.
  60. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  61. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  62. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  63. Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. Advances in neural information processing systems, 32, 2019.
  64. Ultra-low precision 4-bit training of deep neural networks. Advances in Neural Information Processing Systems, 33:1796–1807, 2020.
  65. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  66. Jörg Tiedemann. Finding alternative translations in a large corpus of movie subtitle. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3518–3522, 2016.
  67. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  68. A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
  69. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
  70. VicunaTeam. Vicuna: An open-source chatbot impressing gpt-4 with 90quality. 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  71. Training deep neural networks with 8-bit floating point numbers. Advances in neural information processing systems, 31, 2018.
  72. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
  73. XLNet: Generalized autoregressive pretraining for language understanding. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf.
  74. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
  75. Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2022.
  76. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  77. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  78. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (20)
  1. Houwen Peng (36 papers)
  2. Kan Wu (42 papers)
  3. Yixuan Wei (16 papers)
  4. Guoshuai Zhao (12 papers)
  5. Yuxiang Yang (91 papers)
  6. Ze Liu (42 papers)
  7. Yifan Xiong (11 papers)
  8. Ziyue Yang (18 papers)
  9. Bolin Ni (11 papers)
  10. Jingcheng Hu (7 papers)
  11. Ruihang Li (3 papers)
  12. Miaosen Zhang (7 papers)
  13. Chen Li (386 papers)
  14. Jia Ning (7 papers)
  15. Ruizhe Wang (24 papers)
  16. Zheng Zhang (488 papers)
  17. Shuguang Liu (5 papers)
  18. Joe Chau (3 papers)
  19. Han Hu (196 papers)
  20. Peng Cheng (229 papers)
Citations (26)

Summary

Training FP8 LLMs: Enhancing Efficiency in Memory and Speed

Introduction to FP8 Mixed-Precision Framework

The training of LLMs has been recognized as a formidable task, particularly due to the considerable computational resources and memory required for such endeavors. In the quest to make the training of these models more efficient and less resource-intensive, the move towards low-precision data formats has emerged as a pivotal approach. Against this backdrop, the paper introduces an FP8 automatic mixed-precision framework specifically designed for LLMs. This framework is distinctive in its focus on utilizing 8-bit data formats for variables including gradients and optimizer states during the LLM training process. By implementing this framework, the research demonstrates a notable advancement in reducing memory usage and improving training speed without sacrificing model accuracy or necessitating changes in training hyperparameters.

Key Contributions and Findings

  1. FP8 Mixed-Precision Training Framework: The proposed framework introduces a gradual integration of 8-bit representations into different components of LLM training - from gradients and optimizer states to collective communication and distributed parallel training. This stepwise incorporation aids in reducing memory usage significantly – by 39% in the training of a GPT-175B model on an H100 GPU platform, alongside a substantial increase in training speed by 75% compared to the BF16 mixed-precision framework.
  2. Technological Innovations: Two pivotal techniques, precision decoupling and automatic scaling, are proposed to mitigate the challenges of underflow, overflow, and quantization errors. Precision decoupling involves assigning reduced precision to components that are not sensitive to precision loss, while automatic scaling dynamically adjusts tensor scaling factors to maintain gradient values within FP8's representational range. These innovations address numerical instabilities and ensure the accuracy and stability of LLM training with FP8.
  3. Extended Applicability and Performance: The FP8 framework's utility isn't restricted to pre-training alone; it extends to fine-tuning tasks such as LLM instruction tuning and reinforcement learning with human feedback. The experimental results demonstrate the framework's general applicability across varied LLM tasks and its potential in significant cost-saving without compromising model performance.
  4. Open-Source Contribution: The research team has made their FP8 training framework publicly available, fostering further research and exploration in the domain of efficient LLM training. This open-source contribution is expected to pave the way for widespread adoption and further innovations in low-precision training methodologies.

Practical Implications and Future Directions

The FP8 mixed-precision training framework marks a significant stride towards making the training of large foundational models more resource-efficient. By achieving substantial reductions in memory usage and improvements in training speed, the framework offers a viable solution to the escalating costs associated with LLM training. Furthermore, the open-source release of the FP8 low-precision training framework invites community engagement, potentially leading to advancements in other areas of AI such as multi-modal models and deployment on edge devices.

From a theoretical standpoint, this work underscores the viability of low-precision formats in maintaining training stability and model performance. The successful implementation of FP8 in LLM training could stimulate further research into even lower-bit training formats, potentially revolutionizing the computational efficiency of AI model training.

Conclusion

In summary, this paper introduces a groundbreaking FP8 mixed-precision training framework that not only reduces memory usage and speeds up the training of LLMs but also maintains model accuracy effectively. Through technical innovations and comprehensive evaluation, the work demonstrates the framework's wide applicability and significant potential in cost reduction. Furthermore, by making their framework available to the public, the authors encourage continued innovation in the field, potentially setting a new standard for efficient LLM training.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com