FP8 versus INT8 for efficient deep learning inference
Abstract: Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.
- Nvidia hopper architecture in-depth, 2022. URL https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/.
- Semantickitti: A dataset for semantic scene understanding of lidar sequences. arXiv, 2019.
- Berkeley-DeepDrive. URL https://github.com/ucbdrive/dla/.
- Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
- Understanding and overcoming the challenges of efficient transformer quantization. In EMNLP, 2021.
- Brunie, N. Modified fused multiply and add for exact low precision product accumulation. IEEE 24th Symposium on Computer Arithmetic (ARITH), 2017.
- High-level area estimation. ISLPED, 2002.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Rethinking atrous convolution for semantic image segmentation, 2017.
- The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Salsanext: Fast, uncertainty-aware semantic segmentation of lidar point clouds for autonomous driving, 2020a.
- URL https://github.com/TiagoCortinhal/SalsaNext.
- The case for 4-bit precision: k-bit inference scaling laws. 2022.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale. In NeuRIPS, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL https://arxiv.org/abs/1810.04805.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Learned step size quantization. International Conference on Learning Representations (ICLR), 2020.
- The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 1 2015.
- In-hindsight quantization range estimation for quantized training. 2021.
- Gptq: Accurate quantization for generative pre-trained transformers. In ICLR, 2023.
- Deep learning with limited numerical precision. International Conference on Machine Learning, ICML, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Searching for mobilenetv3. International Conference on Computer Vision, ICCV, 2019.
- All-you-can-fit 8-bit flexible floating-point format for accurate and memory-efficient inference of deep neural networks, 2021. URL https://arxiv.org/abs/2104.07329.
- ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation, November 2022. URL https://doi.org/10.5281/zenodo.7347926.
- Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
- Fp8 quantization: The power of the exponent. In CVPR, 2022.
- Brecq: Pushing the limit of post-training quantization by block reconstruction. 2021.
- Efficientformer: Vision transformers at mobilenet speed. 2022. URL https://arxiv.org/abs/2206.01191.
- Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312.
- URL https://github.com/tonylins/pytorch-mobilenet-v2.
- Simple and efficient architectures for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2628–2636, 2022.
- Melas-Kyriazi, L. URL https://github.com/lukemelas/EfficientNet-PyTorch.
- Fp8 formats for deep leanring. arXiv preprint arXiv:2209.05433, 2022.
- RangeNet++: Fast and Accurate LiDAR Semantic Segmentation. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2019a.
- URL https://github.com/PRBonn/lidar-bonnetal.
- Data-free quantization through weight equalization and bias correction. International Conference on Computer Vision (ICCV), 2019.
- Up or down? Adaptive rounding for post-training quantization. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 7197–7206. PMLR, 13–18 Jul 2020a. URL http://proceedings.mlr.press/v119/nagel20a.html.
- A white paper on neural network quantization. arXiv preprint arXiv:1308.3432, 2020b.
- Overcoming oscillations in quantization-aware training. 2022.
- 8-bit numerical formats for deep neural networks. 2023.
- Nvidia. Nvidia: Apex automatic mixed precision, 2019. URL https://github.com/NVIDIA/apex.
- Nvidia. Nvidia: Transformer engine, 2022. URL https://github.com/NVIDIA/TransformerEngine.
- Qualcomm-AI-research. URL https://github.com/Qualcomm-AI-research/FFNet.
- Shared microexponents: A little shifting goes a long way. 2023.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520, 2018.
- Neural network quantization with ai model efficiency toolkit (aimet). 2022.
- Snap-Research. URL https://github.com/snap-research/EfficientFormer.
- Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/65fc9fb4897a89789352e211ca2d398f-Paper.pdf.
- Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning, ICML, 2019.
- TIMM. Timm version 0.4.5. URL https://timm.fast.ai/.
- torchvision. torchvision. URL https://pytorch.org/vision/stable/index.html.
- Ultralytics. URL https://github.com/ultralytics/yolov5.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355, Brussels, Belgium, November 2018a. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://www.aclweb.org/anthology/W18-5446.
- Deep high-resolution representation learning for visual recognition. TPAMI, 2019a.
- URL https://github.com/HRNet/HRNet-Semantic-Segmentation/tree/pytorch-v1.1.
- Training deep neural networks with 8-bit floating point numbers. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018b. URL https://proceedings.neurips.cc/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf.
- Huggingface’s transformers: State-of-the-art natural language processing. 2020.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In CVPR, 2022.
- Dense prediction with attentive feature aggregation. 2021. URL https://arxiv.org/abs/2111.00770.
- Training high-performance and large-scale deep neural networks with full 8-bit integers. 2019.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In NeuRIPS, 2022.
- URL https://github.com/tianweiy/CenterPoint.
- Center-based 3d object detection and tracking. CVPR, 2021.
- Q8bert: Quantized 8bit bert. In NeuRIPS, 2019.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 2021.
- Zhang, J. Deeplabv3 code and model, 2018. URL https://github.com/jfzhang95/pytorch-deeplab-xception.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.