Training and inference of large language models using 8-bit floating point (2309.17224v1)

Published 29 Sep 2023 in cs.LG, cs.AR, cs.CL, cs.ET, and cs.PF

Abstract: FP8 formats are gaining popularity to boost the computational efficiency for training and inference of large deep learning models. Their main challenge is that a careful choice of scaling is needed to prevent degradation due to the reduced dynamic range compared to higher-precision formats. Although there exists ample literature about selecting such scalings for INT formats, this critical aspect has yet to be addressed for FP8. This paper presents a methodology to select the scalings for FP8 linear layers, based on dynamically updating per-tensor scales for the weights, gradients and activations. We apply this methodology to train and validate LLMs of the type of GPT and Llama 2 using FP8, for model sizes ranging from 111M to 70B. To facilitate the understanding of the FP8 dynamics, our results are accompanied by plots of the per-tensor scale distribution for weights, activations and gradients during both training and inference.

PDF HTML Abstract

An In-Depth Analysis of Training and Inference Using 8-bit Floating Point in LLMs

The paper explores a significant advancement in the computational efficiency of training and inference in LLMs through the adoption of 8-bit floating-point (FP8) formats. The existing trends towards using reduced numerical precision aim to alleviate constraints related to memory, bandwidth, and computational throughput. While the transition from FP32 to FP16 and BF16 has been extensively studied and implemented in contemporary machine learning systems, FP8 remains less explored, primarily due to its constrained dynamic range which complicates both training processes and inference accuracy.

The authors bridge this gap by proposing and validating a methodology for applying per-tensor scaling strategies to train and validate LLMs like GPT and Llama 2 using FP8. Specifically, they address challenges related to the representation of weights, gradients, and activations which can underflow or overflow given the limited FP8 dynamic range. The paper's proposed methodology involves dynamically updating per-tensor scales, a critical design enhancement that accommodates value shifts during FP8-based operations without compromising on computational integrity or accuracy.

Key Contributions

Per-tensor Scaling Methodology: The paper introduces a framework for dynamically computing scaling biases at both the training and inference phases, enhancing the robust application of FP8 to LLM frameworks. The methodology is grounded in using a maximum absolute value approach to ensure optimal scaling, thus minimizing underflow and overflow risks across diverse operations.
Experimentation Across Sizes: Through comprehensive empirical analysis, the researchers successfully illustrate their methodology on LLMs with sizes ranging from 111 million to 70 billion parameters. The tested models show competitiveness with larger FP16 models, maintaining accuracy without degradation. This large-scale capacity and evaluation extend across models of varying architectures.
Inference and Training Viability: Extending the framework to performance during inference, the authors establish the effectiveness of FP8 formats in large numerical landscapes, such as those in GPT inference. These empirical evaluations underscore that full-scale adoption of FP8 can provide performance parity with higher-precision formats while reducing computational cost.
Compatibility with Existing Architectures: FP8-based architectures are tailored to work efficiently with emerging transformer models like GPT and Llama, highlighting compatibility with prominent architectures. Moreover, the paper describes how FP8 computation could be integrated into existing hardware setups, considering the constraints of memory bandwidth and computational overhead.

Implications and Future Directions

Theoretical Implications:

The adoption of FP8 significantly reshapes the landscape of efficient computational resources for LLMs. The dynamic scaling method outlined provides a theoretical backbone for addressing representational limitations imposed by low-precision formats. Their method signifies a leap toward practical implementations that capitalize on the integrative harmony between reduced numerical precision and stable numerical performance.

Practical Implications:

From a practical standpoint, the adoption of FP8 will potentially reduce energy consumption and hardware costs associated with the deployment of large models. The FP8 format's integration can democratize access to powerful AI models by lowering barriers related to inference costs and computational delay, especially in settings with limited computational resources.

Speculation on Future Developments:

The success of scaling methodologies for FP8 naturally opens avenues for similar advancements in other subfields of AI, such as computer vision and graph neural networks, signal processing, and other data-intensive domains. Further, as hardware capabilities evolve, a pivotal future direction may involve the development of specialized hardware tailored to accommodate dynamic FP8 operations and scaling strategies, bolstering the practical appeal of such methodologies.

Overall, this paper provides an indepth insight into FP8’s application in LLMs, presenting a significant step forward in making efficient and large-scale model training and inference a tangible reality, moving toward more sustainable and accessible machine learning applications. The articulation of detailed scaling methodologies and rigorous validation makes this paper a cornerstone reference for practitioners and researchers in the AI community.

PDF Markdown Bookmark Chat (Pro)

References (37)

Authors (9)

Sergio P. Perez (8 papers)
Yan Zhang (954 papers)
James Briggs (1 paper)
Charlie Blake (6 papers)
Josh Levy-Kramer (3 papers)
Paul Balanca (8 papers)
Carlo Luschi (18 papers)
Stephen Barlow (9 papers)
Andrew William Fitzgibbon (4 papers)

Citations (14)

View on Semantic Scholar

Related Papers

FP8 versus INT8 for efficient deep learning inference (2023)
FP8 Formats for Deep Learning (2022)
FP8-LM: Training FP8 Large Language Models (2023)
Efficient Post-training Quantization with FP8 Formats (2023)
Unit Scaling: Out-of-the-Box Low-Precision Training (2023)

Find Related Papers

YouTube

Show All Videos