Efficient Post-training Quantization with FP8 Formats (2309.14592v2)

Published 26 Sep 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, LLMing, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.

Authors (6)

Haihao Shen (11 papers)
Naveen Mellempudi (11 papers)
Xin He (135 papers)
Qun Gao (3 papers)
Chang Wang (28 papers)
Mengni Wang (1 paper)

Citations (12)

View on Semantic Scholar

Summary

Efficient Post-training Quantization with FP8 Formats

The paper "Efficient Post-training Quantization with FP8 Formats" presents a detailed paper on the advantages of 8-bit floating-point (FP8) data formats in post-training quantization for deep neural networks. Conducted by researchers from Intel and AMD, the paper spans 75 unique network architectures across various application domains, including machine translation, LLMing, text generation, image classification, and segmentation.

Key Contributions and Findings

The central contributions of this paper include the development of a unified FP8 quantization workflow and the empirical validation of FP8 formats' superiority over INT8. The paper investigates three different FP8 representations—E5M2, E4M3, and E3M4—to analyze the trade-offs between dynamic range and precision. Here are some significant findings:

Workload Coverage and Accuracy: FP8 formats demonstrate superior workload coverage at 92.64%, compared to 65.87% with INT8. Particularly, the E4M3 format shows high suitability for NLP models with 96.32% coverage, while E3M4 slightly outperforms E4M3 in computer vision tasks with 78.95% coverage.
Quantization Workflow: The paper outlines a standard and an extended quantization workflow. The standard quantization scheme applies broadly to common operators like Convolution, Linear, and Embedding, while the extended scheme addresses specific application needs such as LayerNorm, BatchNorm, and mixed FP8 formats.
Mixed FP8 Formats: Utilizing a combination of FP8 formats (e.g., E4M3 for activations and E3M4 for weights) shows appreciable improvement in preserving model accuracy across various NLP tasks.

Theoretical and Practical Implications

The theoretical implications revolve around how FP8 can effectively handle the common issues related to INT8 quantization, such as limited dynamic range and poor representation of outliers. The practical implications are far-reaching; the ability to use FP8 for post-training quantization can significantly reduce computational overhead while maintaining high model accuracy, making it suitable for both data center and edge deployments.

Experimental Setup and Results

The researchers employed the FP8 Emulation Toolkit and the Intel Neural Compressor to evaluate their quantization methods, covering over 200 tasks using 75 different models and more than 20 datasets. The models span various categories, including text and natural language processing, image and computer vision, audio and speech processing, and recommendation systems.

Accuracy Metrics: FP8 formats, particularly E4M3 and E3M4, show minimal accuracy loss compared to FP32 baselines. For instance, the E4M3 format achieves high accuracy for both computer vision and NLP models, outperforming INT8 in several cases.
Generation Quality: For generative models like Stable Diffusion, FP8 formats yielded better image quality and lower Fréchet Inception Distance (FID) scores than INT8.
Extended Quantization Recipes: The paper explores various extended recipes, such as dynamic quantization and handling more operations like LayerNorm and BatchNorm, further emphasizing the versatility and robustness of FP8 formats.

Discussion and Future Directions

The paper discusses specific challenges and solutions in quantizing neural networks using FP8. For instance, the quantization of the first and last operators in convolutional networks revealed a significant influence on accuracy. Moreover, the authors note that simple max scaling suffices for handling outliers in FP8 formats, contrary to the requirement for more sophisticated methods in INT8 quantization.

Looking ahead, the paper suggests that future work may explore FP8 quantization across even more varied LLM models and contribute findings to open-source communities. Potential avenues include further optimizing FP8 quantization recipes and examining their impacts on newer model architectures.

Conclusion

In summary, "Efficient Post-training Quantization with FP8 Formats" provides a comprehensive and rigorous analysis of FP8 formats for neural network quantization. The findings underscore the formats' advantages in terms of workload coverage, model accuracy, and operational flexibility, highlighting their potential as a viable alternative to traditional INT8 quantization methods. The paper's contributions lay a solid foundation for future research and practical implementations in the domain of efficient neural network inference.