Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FP8 Formats for Deep Learning (2209.05433v2)

Published 12 Sep 2022 in cs.LG

Abstract: FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, LLMs. We also examine FP8 post-training-quantization of LLMs trained using 16-bit formats that resisted fixed point int8 quantization.

FP8 Formats for Deep Learning: An Analysis

Introduction to FP8

In the field of deep learning, the quest for efficiency and speed in training and inference processes is unending. The transition from 32-bit floating point (FP32) to 16-bit formats (FP16 and bfloat16) has been a significant step forward, enabling faster computations and lower memory requirements. Building upon this foundation, an 8-bit floating point format, FP8, emerges as the next frontier in precision reduction, offering potential for further accelerating deep learning tasks. This paper presents a comprehensive investigation into two FP8 encodings, E4M3 (4-bit exponent, 3-bit mantissa) and E5M2 (5-bit exponent, 2-bit mantissa), evaluating their effectiveness across a spectrum of deep learning applications, including large-scale LLMs and various image and language tasks.

FP8: The Proposed Formats

FP8 aims to strike a balance between computational efficiency and the precision necessary for deep learning tasks. The E4M3 format, primarily recommended for weight and activation tensors, extends the dynamic range by exclusively representing one pattern for NaNs and excluding infinities, thereby boosting the representable magnitude range. E5M2, recommended for gradient tensors, adheres closely to IEEE-754 conventions, facilitating straightforward conversion processes between FP16 and FP8.

Empirical Validation

The paper's empirical evaluation showcases that models trained with FP8 can match the accuracy of those trained with higher precision formats (FP16 or bfloat16) across various tasks without the need to alter model architectures or training hyperparameters. Significant findings include:

  • Image classification tasks on the ILSVRC12 dataset, including ResNet and VGG architectures, achieved comparable top-1 accuracy within the run-to-run variation of higher-precision formats.
  • LLMs, including Transformer and recurrent neural network architectures, exhibited minimal variation in evaluation scores and perplexity when trained with FP8 versus higher precision baselines.

Theoretical Implications and Practical Considerations

FP8's introduction and validation bear substantial implications. Theoretically, FP8 challenges the prevailing assumptions about the necessity of higher precision for deep learning training and inference. Practically, it heralds a shift towards more resource-efficient computing, potentially lowering the barriers to training larger models and democratizing access to state-of-the-art AI technologies.

Future Directions

The results open avenues for further research into optimization techniques tailored for FP8, exploring its applicability across a wider range of models and tasks. Moreover, hardware support for FP8, considering its differentiated requirements for exponent and mantissa lengths, could catalyze its adoption, making efficient AI more accessible.

Concluding Thoughts

The exploration of FP8 formats for deep learning posits a compelling case for precision reduction as a pathway to accelerating AI innovation. By meticulously investigating FP8's efficacy across a gamut of deep learning tasks and upholding rigorous IEEE-754 conventions, this paper sets a foundation for the next evolution in AI computation. The convergence of theoretical innovation and empirical validation in the presented work underscores the potential of FP8 to chart a new course in the efficiency and accessibility of AI technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Paulius Micikevicius (9 papers)
  2. Dusan Stosic (12 papers)
  3. Neil Burgess (8 papers)
  4. Marius Cornea (2 papers)
  5. Pradeep Dubey (31 papers)
  6. Richard Grisenthwaite (2 papers)
  7. Sangwon Ha (5 papers)
  8. Alexander Heinecke (21 papers)
  9. Patrick Judd (9 papers)
  10. John Kamalu (8 papers)
  11. Naveen Mellempudi (11 papers)
  12. Stuart Oberman (3 papers)
  13. Mohammad Shoeybi (60 papers)
  14. Michael Siu (3 papers)
  15. Hao Wu (623 papers)
Citations (102)
Youtube Logo Streamline Icon: https://streamlinehq.com