Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FP8 versus INT8 for efficient deep learning inference (2303.17951v2)

Published 31 Mar 2023 in cs.LG

Abstract: Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.

Evaluating FP8 versus INT8 for Efficient Deep Learning Inference

The paper "FP8 versus INT8 for efficient deep learning inference" presents a detailed comparative analysis of the FP8 and INT8 numerical formats concerning their efficacy in deep learning inference, particularly on edge devices. Authored by researchers at Qualcomm AI Research, it explores both practical hardware and theoretical performance aspects associated with these formats. The paper emphasizes the hardware inefficiency of the FP8 format compared to INT8, the implications for neural network accuracy, and the strategic superiority of INT8 in practical applications.

Introduction

The paper begins by outlining the motivations for adopting FP8 formats, which have gained traction with the rollout by Nvidia and considerations for standardization by IEEE for deep learning training. Despite FP8's appeal in specific scenarios during training, especially for handling gradients, the authors critically investigate whether FP8 is suitable for inference.

Hardware Efficiency Analysis

One of the major contributions of the paper is its in-depth assessment of hardware considerations regarding FP8 and INT8 formats. FP8, as implemented in the latest Nvidia architectures, demonstrates at least 50% less efficiency in area and energy usage despite the potential benefits in handling mathematical complexities associated with large dynamic ranges. Specifically, the paper outlines that computing using fixed-point accumulators in INT8 is more efficient than floating-point computations, which FP8 necessitates. This inefficiency becomes apparent when factoring in both computation speed and energy consumption, pivotal considerations for edge devices.

Accuracy Evaluation for Neural Networks

The paper provides substantial empirical insights into the effects of FP8 format adoption in post-training quantization (PTQ) and quantization-aware training (QAT) settings. Through theoretical analyses showing how these formats handle outliers, the authors illustrate that the advantage of FP8 over INT8 primarily rests on its capacity to manage distributions with significant outliers. However, for distributions closer to Gaussian shapes without sizable outliers, INT8 consistently offers competitive or superior representation and accuracy.

PTQ and QAT Comparative Results

Experiments demonstrate varying results across different model types. While FP8-E4 seems advantageous for transformer architectures, which inherently produce significant outliers due to layer normalization, INT8 outperforms FP8 in tasks like image classification and segmentation, except for models possessing such irregular distributions. The findings portray that when trained from scratch or fine-tuned, INT8 models often recover or surpass FP8 performance levels even when initial distributions were problematic.

INT8 Versus FP8: Practical Implications

The paper concludes with a compelling case for the continued dominance of the INT format family, ranging from INT16 to INT4, in deep learning inference, emphasizing efficiency and adaptability. INT formats offer robust solutions for dealing with diverse deep learning workloads. Tools like Qualcomm’s AIMET, ubiquitous in handling INT quantization and optimizing networks efficiently, further bolster this advantage.

Conclusion and Future Directions

The paper underscores the limits of FP8 in hardware efficiency and quantization across real-world scenarios, particularly emphasizing obstacles like the computational expense of floating-point operations compared to integer manipulations. While FP8 could be beneficial for specific model architectures during training, the practical inference deployment on advanced computational edge devices continues to find INT8 solutions optimal.

Moving forward, this paper lays groundwork for exploring more nuanced quantization techniques that could bridge gaps in handling outliers, easing transitions between training and deployment, without sacrificing efficiency or accuracy crucial for edge computing applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Nvidia hopper architecture in-depth, 2022. URL https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.
  2. Semantickitti: A dataset for semantic scene understanding of lidar sequences. arXiv, 2019.
  3. Berkeley-DeepDrive. URL https://github.com/ucbdrive/dla/.
  4. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
  5. Understanding and overcoming the challenges of efficient transformer quantization. In EMNLP, 2021.
  6. Brunie, N. Modified fused multiply and add for exact low precision product accumulation. IEEE 24th Symposium on Computer Arithmetic (ARITH), 2017.
  7. High-level area estimation. ISLPED, 2002.
  8. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  9. Rethinking atrous convolution for semantic image segmentation, 2017.
  10. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  11. Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds for autonomous driving, 2020a.
  12. URL https://github.com/TiagoCortinhal/SalsaNext.
  13. The case for 4-bit precision: k-bit inference scaling laws. 2022.
  14. Llm.int8(): 8-bit matrix multiplication for transformers at scale. In NeuRIPS, 2022.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL https://arxiv.org/abs/1810.04805.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  17. Learned step size quantization. International Conference on Learning Representations (ICLR), 2020.
  18. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 1 2015.
  19. In-hindsight quantization range estimation for quantized training. 2021.
  20. Gptq: Accurate quantization for generative pre-trained transformers. In ICLR, 2023.
  21. Deep learning with limited numerical precision. International Conference on Machine Learning, ICML, 2015.
  22. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  23. Searching for mobilenetv3. International Conference on Computer Vision, ICCV, 2019.
  24. All-you-can-fit 8-bit flexible floating-point format for accurate and memory-efficient inference of deep neural networks, 2021. URL https://arxiv.org/abs/2104.07329.
  25. ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation, November 2022. URL https://doi.org/10.5281/zenodo.7347926.
  26. Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
  27. Fp8 quantization: The power of the exponent. In CVPR, 2022.
  28. Brecq: Pushing the limit of post-training quantization by block reconstruction. 2021.
  29. Efficientformer: Vision transformers at mobilenet speed. 2022. URL https://arxiv.org/abs/2206.01191.
  30. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312.
  31. URL https://github.com/tonylins/pytorch-mobilenet-v2.
  32. Simple and efficient architectures for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2628–2636, 2022.
  33. Melas-Kyriazi, L. URL https://github.com/lukemelas/EfficientNet-PyTorch.
  34. Fp8 formats for deep leanring. arXiv preprint arXiv:2209.05433, 2022.
  35. RangeNet++: Fast and Accurate LiDAR Semantic Segmentation. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2019a.
  36. URL https://github.com/PRBonn/lidar-bonnetal.
  37. Data-free quantization through weight equalization and bias correction. International Conference on Computer Vision (ICCV), 2019.
  38. Up or down? Adaptive rounding for post-training quantization. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  7197–7206. PMLR, 13–18 Jul 2020a. URL http://proceedings.mlr.press/v119/nagel20a.html.
  39. A white paper on neural network quantization. arXiv preprint arXiv:1308.3432, 2020b.
  40. Overcoming oscillations in quantization-aware training. 2022.
  41. 8-bit numerical formats for deep neural networks. 2023.
  42. Nvidia. Nvidia: Apex automatic mixed precision, 2019. URL https://github.com/NVIDIA/apex.
  43. Nvidia. Nvidia: Transformer engine, 2022. URL https://github.com/NVIDIA/TransformerEngine.
  44. Qualcomm-AI-research. URL https://github.com/Qualcomm-AI-research/FFNet.
  45. Shared microexponents: A little shifting goes a long way. 2023.
  46. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  47. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4510–4520, 2018.
  48. Neural network quantization with ai model efficiency toolkit (aimet). 2022.
  49. Snap-Research. URL https://github.com/snap-research/EfficientFormer.
  50. Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/65fc9fb4897a89789352e211ca2d398f-Paper.pdf.
  51. Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning, ICML, 2019.
  52. TIMM. Timm version 0.4.5. URL https://timm.fast.ai/.
  53. torchvision. torchvision. URL https://pytorch.org/vision/stable/index.html.
  54. Ultralytics. URL https://github.com/ultralytics/yolov5.
  55. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  353–355, Brussels, Belgium, November 2018a. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://www.aclweb.org/anthology/W18-5446.
  56. Deep high-resolution representation learning for visual recognition. TPAMI, 2019a.
  57. URL https://github.com/HRNet/HRNet-Semantic-Segmentation/tree/pytorch-v1.1.
  58. Training deep neural networks with 8-bit floating point numbers. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018b. URL https://proceedings.neurips.cc/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf.
  59. Huggingface’s transformers: State-of-the-art natural language processing. 2020.
  60. Smoothquant: Accurate and efficient post-training quantization for large language models. In CVPR, 2022.
  61. Dense prediction with attentive feature aggregation. 2021. URL https://arxiv.org/abs/2111.00770.
  62. Training high-performance and large-scale deep neural networks with full 8-bit integers. 2019.
  63. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In NeuRIPS, 2022.
  64. URL https://github.com/tianweiy/CenterPoint.
  65. Center-based 3d object detection and tracking. CVPR, 2021.
  66. Q8bert: Quantized 8bit bert. In NeuRIPS, 2019.
  67. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 2021.
  68. Zhang, J. Deeplabv3 code and model, 2018. URL https://github.com/jfzhang95/pytorch-deeplab-xception.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Mart van Baalen (18 papers)
  2. Andrey Kuzmin (8 papers)
  3. Suparna S Nair (1 paper)
  4. Yuwei Ren (8 papers)
  5. Eric Mahurin (2 papers)
  6. Chirag Patel (10 papers)
  7. Sundar Subramanian (6 papers)
  8. Sanghyuk Lee (80 papers)
  9. Markus Nagel (33 papers)
  10. Joseph Soriaga (5 papers)
  11. Tijmen Blankevoort (37 papers)
Citations (35)
Youtube Logo Streamline Icon: https://streamlinehq.com