FP8-BERT: Post-Training Quantization for Transformer (2312.05725v2)
Abstract: Transformer-based models, such as BERT, have been widely applied in a wide range of natural language processing tasks. However, one inevitable side effect is that they require massive memory storage and inference cost when deployed in production. Quantization is one of the popularized ways to alleviate the cost. However, the previous 8-bit quantization strategy based on INT8 data format either suffers from the degradation of accuracy in a Post-Training Quantization (PTQ) fashion or requires an expensive Quantization-Aware Training (QAT) process. Recently, a new numeric format FP8 (i.e. floating-point of 8-bits) has been proposed and supported in commercial AI computing platforms such as H100. In this paper, we empirically validate the effectiveness of FP8 as a way to do Post-Training Quantization without significant loss of accuracy, with a simple calibration and format conversion process. We adopt the FP8 standard proposed by NVIDIA Corp. (2022) in our extensive experiments of BERT variants on GLUE and SQuAD v1.1 datasets, and show that PTQ with FP8 can significantly improve the accuracy upon that with INT8, to the extent of the full-precision model.
- First-Generation Inference Accelerator Deployment at Facebook. arXiv:2107.04140.
- Arm Ltd. 2020. Cortex-M.
- Scalable Methods for 8-bit Training of Neural Networks. arXiv:1805.11046.
- Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model. arXiv:1906.00532.
- Language Models are Few-Shot Learners. arXiv:2005.14165.
- Bfloat16 Processing for Neural Networks. In 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH), 88–91.
- Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks. In International Conference on Learning Representations.
- Training deep neural networks with low precision multiplications. arXiv:1412.7024.
- Graphcore Inc. 2022. C600 IPU-PROCESSOR PCIe CARD.
- Deep Learning with Limited Numerical Precision. In Bach, F.; and Blei, D., eds., Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, 1737–1746. Lille, France: PMLR.
- Habana Labs Ltd. 2022. HABANA GAUDI2 WHITE PAPER.
- Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. International Conference on Learning Representations (ICLR).
- Horowitz, M. 2014. 1.1 Computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 10–14.
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- I-BERT: Integer-only BERT Quantization. International Conference on Machine Learning (Accepted).
- FP8 Quantization: The Power of the Exponent. arXiv:2208.09225.
- Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Inference. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), 6–9.
- Mixed Precision Training. arXiv:1710.03740.
- FP8 Formats for Deep Learning. arXiv:2209.05433.
- Migacz, S. 2017. 8-bit inference with tensorrt. In GPU technology conference, volume 2, 5.
- Convolutional Neural Networks using Logarithmic Data Representation. arXiv:1603.01025.
- 8-bit Numerical Formats for Deep Neural Networks. arXiv:2206.02915.
- NVIDIA Corp. 2022. NVIDIA H100 Tensor Core GPU Architecture.
- Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Training Deep Neural Networks with 8-bit Floating Point Numbers. In Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation. arXiv:2004.09602.
- Q8BERT: Quantized 8Bit BERT. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), 36–39.
- Jianwei Li (30 papers)
- Tianchi Zhang (12 papers)
- Ian En-Hsu Yen (8 papers)
- Dongkuan Xu (43 papers)