FP8-LM: Training FP8 Large Language Models (2310.18313v2)
Abstract: In this paper, we explore FP8 low-bit data formats for efficient training of LLMs. Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 37%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Microsoft Bing. Bing webmaster tools. 2022. URL https://www.bing.com/webmasters/.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
- Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, 2022.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
- PaLM: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations, 2021.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Training compute-optimal large language models. arXiv:2203.15556, 2022.
- Binarized neural networks. Advances in neural information processing systems, 29, 2016.
- HuggingFace. wikipedia - datasets at hugging face. 2022. URL https://huggingface.co/datasets/wikipedia.
- Data movement is all you need: A case study on optimizing transformers. Proceedings of Machine Learning and Systems, 3:711–732, 2021.
- Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, 2017.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, San Diego, CA, 2015. URL http://arxiv.org/abs/1412.6980.
- The stack: 3 tb of permissively licensed source code. Transactions on Machine Learning Research, 2022.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, 2022.
- Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, pages 766–775, 2023a.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
- Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1, 2021.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
- Microsoft. Azure high-performance computing. 2023. URL https://azure.microsoft.com/en-us/solutions/high-performance-computing.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018.
- Nvidia. Apex. 2018. URL https://nvidia.github.io/apex.
- Nvidia. The nvidia collective communications library. 2020. URL https://developer.nvidia.com/nccl.
- Nvidia. Nvidia h100 tensor core gpu architecture. 2022a. URL https://resources.nvidia.com/en-us-tensor-core.
- Nvidia. Nvidia transformer engine. 2022b. URL https://docs.nvidia.com/deeplearning/transformer-engine/index.html.
- Nvidia. Using fp8 with transformer engine. 2022c. URL https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html.
- OpenAI. Model index for researchers. 2022. URL https://platform.openai.com/docs/model-index-for-researchers.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, 2016.
- Shawn Presser. Books3. https://twitter.com/theshawwn/status/1320282149329784833, 2020.
- Language models are unsupervised multitask learners. 2019.
- Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2019.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
- Redpajama. Redpajama-data: an open source recipe to reproduce llama training dataset. 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium, 2011.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2018.
- BLOOM: A 176B-parameter open-access multilingual language model. ArXiv, abs/2211.05100, 2022.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- ShareGPT. Openchat: Advancing open-source language models with imperfect data. 2023. URL https://sharegpt.com/.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. Advances in neural information processing systems, 32, 2019.
- Ultra-low precision 4-bit training of deep neural networks. Advances in Neural Information Processing Systems, 33:1796–1807, 2020.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Jörg Tiedemann. Finding alternative translations in a large corpus of movie subtitle. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3518–3522, 2016.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
- Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
- VicunaTeam. Vicuna: An open-source chatbot impressing gpt-4 with 90quality. 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Training deep neural networks with 8-bit floating point numbers. Advances in neural information processing systems, 31, 2018.
- Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
- XLNet: Generalized autoregressive pretraining for language understanding. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
- Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
- Houwen Peng (36 papers)
- Kan Wu (42 papers)
- Yixuan Wei (16 papers)
- Guoshuai Zhao (12 papers)
- Yuxiang Yang (91 papers)
- Ze Liu (42 papers)
- Yifan Xiong (11 papers)
- Ziyue Yang (18 papers)
- Bolin Ni (11 papers)
- Jingcheng Hu (7 papers)
- Ruihang Li (3 papers)
- Miaosen Zhang (7 papers)
- Chen Li (386 papers)
- Jia Ning (7 papers)
- Ruizhe Wang (24 papers)
- Zheng Zhang (488 papers)
- Shuguang Liu (5 papers)
- Joe Chau (3 papers)
- Han Hu (196 papers)
- Peng Cheng (229 papers)