FP8 versus INT8 for efficient deep learning inference (2303.17951v2)

Published 31 Mar 2023 in cs.LG

Abstract: Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.

PDF HTML Abstract

Evaluating FP8 versus INT8 for Efficient Deep Learning Inference

The paper "FP8 versus INT8 for efficient deep learning inference" presents a detailed comparative analysis of the FP8 and INT8 numerical formats concerning their efficacy in deep learning inference, particularly on edge devices. Authored by researchers at Qualcomm AI Research, it explores both practical hardware and theoretical performance aspects associated with these formats. The paper emphasizes the hardware inefficiency of the FP8 format compared to INT8, the implications for neural network accuracy, and the strategic superiority of INT8 in practical applications.

Introduction

The paper begins by outlining the motivations for adopting FP8 formats, which have gained traction with the rollout by Nvidia and considerations for standardization by IEEE for deep learning training. Despite FP8's appeal in specific scenarios during training, especially for handling gradients, the authors critically investigate whether FP8 is suitable for inference.

Hardware Efficiency Analysis

One of the major contributions of the paper is its in-depth assessment of hardware considerations regarding FP8 and INT8 formats. FP8, as implemented in the latest Nvidia architectures, demonstrates at least 50% less efficiency in area and energy usage despite the potential benefits in handling mathematical complexities associated with large dynamic ranges. Specifically, the paper outlines that computing using fixed-point accumulators in INT8 is more efficient than floating-point computations, which FP8 necessitates. This inefficiency becomes apparent when factoring in both computation speed and energy consumption, pivotal considerations for edge devices.

Accuracy Evaluation for Neural Networks

The paper provides substantial empirical insights into the effects of FP8 format adoption in post-training quantization (PTQ) and quantization-aware training (QAT) settings. Through theoretical analyses showing how these formats handle outliers, the authors illustrate that the advantage of FP8 over INT8 primarily rests on its capacity to manage distributions with significant outliers. However, for distributions closer to Gaussian shapes without sizable outliers, INT8 consistently offers competitive or superior representation and accuracy.

PTQ and QAT Comparative Results

Experiments demonstrate varying results across different model types. While FP8-E4 seems advantageous for transformer architectures, which inherently produce significant outliers due to layer normalization, INT8 outperforms FP8 in tasks like image classification and segmentation, except for models possessing such irregular distributions. The findings portray that when trained from scratch or fine-tuned, INT8 models often recover or surpass FP8 performance levels even when initial distributions were problematic.

INT8 Versus FP8: Practical Implications

The paper concludes with a compelling case for the continued dominance of the INT format family, ranging from INT16 to INT4, in deep learning inference, emphasizing efficiency and adaptability. INT formats offer robust solutions for dealing with diverse deep learning workloads. Tools like Qualcomm’s AIMET, ubiquitous in handling INT quantization and optimizing networks efficiently, further bolster this advantage.

Conclusion and Future Directions

The paper underscores the limits of FP8 in hardware efficiency and quantization across real-world scenarios, particularly emphasizing obstacles like the computational expense of floating-point operations compared to integer manipulations. While FP8 could be beneficial for specific model architectures during training, the practical inference deployment on advanced computational edge devices continues to find INT8 solutions optimal.

Moving forward, this paper lays groundwork for exploring more nuanced quantization techniques that could bridge gaps in handling outliers, easing transitions between training and deployment, without sacrificing efficiency or accuracy crucial for edge computing applications.