Training and inference of large language models using 8-bit floating point (2309.17224v1)
Abstract: FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect has yet to be addressed for FP8. This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations. We apply this methodology to train and validate LLMs of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B. To facilitate the understanding of the FP8 dynamics, our results are accompanied by plots of the per-tensor scale distribution for weights, activations and gradients during both training and inference.
- Intriguing properties of quantization at scale. arXiv preprint arXiv:2305.19268, 2023.
- Unit scaling: Out-of-the-box low-precision training. arXiv preprint arXiv:2303.11257, 2023.
- Understanding and overcoming the challenges of efficient transformer quantization. arXiv preprint arXiv:2109.12948, 2021.
- Quantizable transformers: Removing outliers by helping attention heads do nothing. arXiv preprint arXiv:2306.12929, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- C. Chen. Transformer inference arithmetic, 2022. URL https://kipp.ly/blog/transformer-inference-arithmetic/. (Online: accessed 28 September 2023).
- T. Dettmers and L. Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. https://arxiv.org/abs/2212.09720, 2022.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208, 2023.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- A framework for few-shot language model evaluation, Sept. 2021. URL https://doi.org/10.5281/zenodo.5371628.
- Graphcore. Bow-2000 ipu-machine datasheet, 2022a. URL https://docs.graphcore.ai/projects/bow-2000-datasheet/en/latest/index.html#bow-2000-ipu-machine-datasheet. (Online: accessed 28 September 2023).
- Graphcore. Graphcore tile vertex isa release 1.3.1 ipu21, 2022b. URL https://docs.graphcore.ai/projects/isa/en/latest/_static/TileVertexISA-IPU21-1.3.1.pdf. (Online: accessed 28 September 2023).
- C. S. IEEE. Ieee standard for floating-point arithmetic, 2019. Pages 1-84.
- Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413, 2019.
- A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
- Mixed-precision training for nlp and speech recognition with openseq2seq. arXiv preprint arXiv:1805.10387, 2018.
- Fp8 quantization: The power of the exponent. arXiv preprint arXiv:2208.09225, 2022.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
- 8-bit numerical formats for deep neural networks. arXiv preprint arXiv:2206.02915, 2022.
- Nvidia. Nvidia h100 tensor core gpu architecture, 2022a. URL https://resources.nvidia.com/en-us-tensor-core. (Online: accessed 28 September 2023).
- Nvidia. Transformer engine. https://github.com/NVIDIA/TransformerEngine, 2022b. (Online: accessed 28 September 2023).
- I. W. G. P3109. Interim report on 8-bit binary floating-point formats. https://github.com/P3109/Public/tree/main/Shared%20Reports, 2023. (Online: accessed 28 September 2023).
- S. P. Perez. Training large models more stably with automatic loss scaling, 2022. URL https://www.graphcore.ai/posts/training-large-models-more-stably-with-automatic-loss-scaling. (Online: accessed 28 September 2023).
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- A. Sanger. Why gpt-3.5 is (mostly) cheaper than llama 2, 2023. URL https://www.cursor.so/blog/llama-inference. (Online: accessed 28 September 2023).
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Tesla. A guide to tesla’s configurable floating point formats & arithmetic. https://tesla-cdn.thron.com/static/MXMU3S_tesla-dojo-technology_1WDVZN.pdf, 2021. (Online: accessed 28 September 2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Fp8 versus int8 for efficient deep learning inference, 2023.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
- Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems, 34:17084–17097, 2021.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.