Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training and inference of large language models using 8-bit floating point (2309.17224v1)

Published 29 Sep 2023 in cs.LG, cs.AR, cs.CL, cs.ET, and cs.PF

Abstract: FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect has yet to be addressed for FP8. This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations. We apply this methodology to train and validate LLMs of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B. To facilitate the understanding of the FP8 dynamics, our results are accompanied by plots of the per-tensor scale distribution for weights, activations and gradients during both training and inference.

An In-Depth Analysis of Training and Inference Using 8-bit Floating Point in LLMs

The paper explores a significant advancement in the computational efficiency of training and inference in LLMs through the adoption of 8-bit floating-point (FP8) formats. The existing trends towards using reduced numerical precision aim to alleviate constraints related to memory, bandwidth, and computational throughput. While the transition from FP32 to FP16 and BF16 has been extensively studied and implemented in contemporary machine learning systems, FP8 remains less explored, primarily due to its constrained dynamic range which complicates both training processes and inference accuracy.

The authors bridge this gap by proposing and validating a methodology for applying per-tensor scaling strategies to train and validate LLMs like GPT and Llama 2 using FP8. Specifically, they address challenges related to the representation of weights, gradients, and activations which can underflow or overflow given the limited FP8 dynamic range. The paper's proposed methodology involves dynamically updating per-tensor scales, a critical design enhancement that accommodates value shifts during FP8-based operations without compromising on computational integrity or accuracy.

Key Contributions

  1. Per-tensor Scaling Methodology: The paper introduces a framework for dynamically computing scaling biases at both the training and inference phases, enhancing the robust application of FP8 to LLM frameworks. The methodology is grounded in using a maximum absolute value approach to ensure optimal scaling, thus minimizing underflow and overflow risks across diverse operations.
  2. Experimentation Across Sizes: Through comprehensive empirical analysis, the researchers successfully illustrate their methodology on LLMs with sizes ranging from 111 million to 70 billion parameters. The tested models show competitiveness with larger FP16 models, maintaining accuracy without degradation. This large-scale capacity and evaluation extend across models of varying architectures.
  3. Inference and Training Viability: Extending the framework to performance during inference, the authors establish the effectiveness of FP8 formats in large numerical landscapes, such as those in GPT inference. These empirical evaluations underscore that full-scale adoption of FP8 can provide performance parity with higher-precision formats while reducing computational cost.
  4. Compatibility with Existing Architectures: FP8-based architectures are tailored to work efficiently with emerging transformer models like GPT and Llama, highlighting compatibility with prominent architectures. Moreover, the paper describes how FP8 computation could be integrated into existing hardware setups, considering the constraints of memory bandwidth and computational overhead.

Implications and Future Directions

Theoretical Implications:

The adoption of FP8 significantly reshapes the landscape of efficient computational resources for LLMs. The dynamic scaling method outlined provides a theoretical backbone for addressing representational limitations imposed by low-precision formats. Their method signifies a leap toward practical implementations that capitalize on the integrative harmony between reduced numerical precision and stable numerical performance.

Practical Implications:

From a practical standpoint, the adoption of FP8 will potentially reduce energy consumption and hardware costs associated with the deployment of large models. The FP8 format's integration can democratize access to powerful AI models by lowering barriers related to inference costs and computational delay, especially in settings with limited computational resources.

Speculation on Future Developments:

The success of scaling methodologies for FP8 naturally opens avenues for similar advancements in other subfields of AI, such as computer vision and graph neural networks, signal processing, and other data-intensive domains. Further, as hardware capabilities evolve, a pivotal future direction may involve the development of specialized hardware tailored to accommodate dynamic FP8 operations and scaling strategies, bolstering the practical appeal of such methodologies.

Overall, this paper provides an indepth insight into FP8’s application in LLMs, presenting a significant step forward in making efficient and large-scale model training and inference a tangible reality, moving toward more sustainable and accessible machine learning applications. The articulation of detailed scaling methodologies and rigorous validation makes this paper a cornerstone reference for practitioners and researchers in the AI community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Intriguing properties of quantization at scale. arXiv preprint arXiv:2305.19268, 2023.
  2. Unit scaling: Out-of-the-box low-precision training. arXiv preprint arXiv:2303.11257, 2023.
  3. Understanding and overcoming the challenges of efficient transformer quantization. arXiv preprint arXiv:2109.12948, 2021.
  4. Quantizable transformers: Removing outliers by helping attention heads do nothing. arXiv preprint arXiv:2306.12929, 2023.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. C. Chen. Transformer inference arithmetic, 2022. URL https://kipp.ly/blog/transformer-inference-arithmetic/. (Online: accessed 28 September 2023).
  7. T. Dettmers and L. Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws. https://arxiv.org/abs/2212.09720, 2022.
  8. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  9. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208, 2023.
  10. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  11. A framework for few-shot language model evaluation, Sept. 2021. URL https://doi.org/10.5281/zenodo.5371628.
  12. Graphcore. Bow-2000 ipu-machine datasheet, 2022a. URL https://docs.graphcore.ai/projects/bow-2000-datasheet/en/latest/index.html#bow-2000-ipu-machine-datasheet. (Online: accessed 28 September 2023).
  13. Graphcore. Graphcore tile vertex isa release 1.3.1 ipu21, 2022b. URL https://docs.graphcore.ai/projects/isa/en/latest/_static/TileVertexISA-IPU21-1.3.1.pdf. (Online: accessed 28 September 2023).
  14. C. S. IEEE. Ieee standard for floating-point arithmetic, 2019. Pages 1-84.
  15. Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413, 2019.
  16. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
  17. Mixed-precision training for nlp and speech recognition with openseq2seq. arXiv preprint arXiv:1805.10387, 2018.
  18. Fp8 quantization: The power of the exponent. arXiv preprint arXiv:2208.09225, 2022.
  19. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  20. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  21. Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
  22. 8-bit numerical formats for deep neural networks. arXiv preprint arXiv:2206.02915, 2022.
  23. Nvidia. Nvidia h100 tensor core gpu architecture, 2022a. URL https://resources.nvidia.com/en-us-tensor-core. (Online: accessed 28 September 2023).
  24. Nvidia. Transformer engine. https://github.com/NVIDIA/TransformerEngine, 2022b. (Online: accessed 28 September 2023).
  25. I. W. G. P3109. Interim report on 8-bit binary floating-point formats. https://github.com/P3109/Public/tree/main/Shared%20Reports, 2023. (Online: accessed 28 September 2023).
  26. S. P. Perez. Training large models more stably with automatic loss scaling, 2022. URL https://www.graphcore.ai/posts/training-large-models-more-stably-with-automatic-loss-scaling. (Online: accessed 28 September 2023).
  27. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
  28. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  29. A. Sanger. Why gpt-3.5 is (mostly) cheaper than llama 2, 2023. URL https://www.cursor.so/blog/llama-inference. (Online: accessed 28 September 2023).
  30. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  31. Tesla. A guide to tesla’s configurable floating point formats & arithmetic. https://tesla-cdn.thron.com/static/MXMU3S_tesla-dojo-technology_1WDVZN.pdf, 2021. (Online: accessed 28 September 2023).
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  33. Fp8 versus int8 for efficient deep learning inference, 2023.
  34. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  35. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  36. Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems, 34:17084–17097, 2021.
  37. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Sergio P. Perez (8 papers)
  2. Yan Zhang (954 papers)
  3. James Briggs (1 paper)
  4. Charlie Blake (6 papers)
  5. Josh Levy-Kramer (3 papers)
  6. Paul Balanca (8 papers)
  7. Carlo Luschi (18 papers)
  8. Stephen Barlow (9 papers)
  9. Andrew William Fitzgibbon (4 papers)
Citations (14)
Youtube Logo Streamline Icon: https://streamlinehq.com