Papers
Topics
Authors
Recent
Search
2000 character limit reached

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

Published 7 Mar 2022 in cs.DC | (2203.03341v3)

Abstract: Tensor Core is a mixed-precision matrix-matrix multiplication unit on NVIDIA GPUs with a theoretical peak performance of more than 300 TFlop/s on Ampere architectures. Tensor Cores were developed in response to the high demand of dense matrix multiplication from machine learning. However, many applications in scientific computing such as preconditioners for iterative solvers and low-precision Fourier transforms can exploit these Tensor Cores. To compute a matrix multiplication on Tensor Cores, we need to convert input matrices to half-precision, which results in loss of accuracy. To avoid this, we can keep the mantissa loss in the conversion using additional half-precision variables and use them for correcting the accuracy of matrix-matrix multiplication. Even with this correction, the use of Tensor Cores yields higher throughput compared to FP32 SIMT Cores. Nevertheless, the correcting capability of this method alone is limited, and the resulting accuracy cannot match that of a matrix multiplication on FP32 SIMT Cores. We address this problem and develop a high accuracy, high performance, and low power consumption matrix-matrix multiplication implementation using Tensor Cores, which exactly matches the accuracy of FP32 SIMT Cores while achieving superior throughput. The implementation is based on NVIDIA's CUTLASS. We found that the key to achieving this accuracy is how to deal with the rounding inside Tensor Cores and underflow probability during the correction computation. Our implementation achieves 51TFlop/s for a limited exponent range using FP16 Tensor Cores and 33TFlop/s for full exponent range of FP32 using TF32 Tensor Cores on NVIDIA A100 GPUs, which outperforms the theoretical FP32 SIMT Core peak performance of 19.5TFlop/s.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM Journal on Scientific Computing, 42(3):C124–C141, January 2020. Publisher: Society for Industrial and Applied Mathematics.
  2. Quantum Accelerators for High-Performance Computing Systems. 2017 IEEE International Conference on Rebooting Computing (ICRC), pages 1–7, November 2017. arXiv: 1712.01423.
  3. Analyzing GPU Tensor Core Potential for Fast Reductions. In 2018 37th International Conference of the Chilean Computer Science Society (SCCC), pages 1–6, November 2018. ISSN: 1522-4902.
  4. Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2):A817–A847, January 2018. Publisher: Society for Industrial and Applied Mathematics.
  5. Accelerating reduction and scan using tensor core units. In Proceedings of the ACM International Conference on Supercomputing, ICS ’19, pages 46–57, New York, NY, USA, June 2019. Association for Computing Machinery.
  6. Numerical Behavior of the NVIDIA Tensor Cores, April 2020. Issue: 2020.10 Number: 2020.10.
  7. EGEMM-TC: accelerating scientific computing on tensor cores with extended precision. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, pages 278–291, New York, NY, USA, February 2021. Association for Computing Machinery.
  8. Quantum-based Molecular Dynamics Simulations using Tensor Cores. arXiv:2107.02737 [physics, physics:quant-ph], July 2021. arXiv: 2107.02737 version: 1.
  9. On the Feasibility of Using Reduced-Precision Tensor Core Operations for Graph Analytics. In 2020 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–7, September 2020. ISSN: 2643-1971.
  10. Verifying Random Quantum Circuits with Arbitrary Geometry Using Tensor Network States Algorithm. Physical Review Letters, 126(7):070502, February 2021. Publisher: American Physical Society.
  11. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613, November 2018.
  12. IBM. IBM Power Systems Announces POWER10 Processor. https://www.ibm.com/blogs/systems/ibm-power-systems-announces-power10-processor/, 2020.
  13. Intel. Ponte Vecchio. https://download.intel.com/newsroom/2021/client-computing/intel-architecture-day-2021-presentation.pdf, 2021.
  14. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826 [cs], April 2018. arXiv: 1804.06826.
  15. In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 1–12, New York, NY, USA, June 2017. Association for Computing Machinery.
  16. Closing the ”quantum supremacy” gap: achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, pages 1–12, New York, NY, USA, November 2021. Association for Computing Machinery.
  17. NVIDIA Tensor Core Programmability, Performance & Precision. arXiv:1803.04014 [cs], March 2018.
  18. DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions. In High Performance Computing, Lecture Notes in Computer Science, pages 230–248, Cham, 2020. Springer International Publishing.
  19. NVIDIA. NVIDIA A100 TENSOR CORE GPU. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf, 2020.
  20. NVIDIA. NVIDIA AMPERE GA102 GPU Architecture V1. https://images.nvidia.com/aem-dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf, 2020.
  21. NVIDIA. NVIDIA AMPERE GA102 GPU Architecture V2. https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf, 2020.
  22. Randomized SVD on Tensor Cores. ISC High Performance, Research poster, June 2020.
  23. Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numerical Algorithms, 59(1):95–118, January 2012.
  24. Preferred Networks. MN-Core - Accelerator for Deep Learning. https://projects.preferred.jp/mn-core/en/, 2018.
  25. Modeling Deep Learning Accelerator Enabled GPUs. arXiv:1811.08309 [cs], February 2019. arXiv: 1811.08309.
  26. The Effectiveness of Low-Precision Floating Arithmetic on Numerical Codes: A Case Study on Power Consumption. pages 199–206, January 2020.
  27. Optimizing the Fast Fourier Transform Using Mixed Precision on Tensor Core Hardware. In 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW), pages 3–7, December 2018.
  28. Establishing the quantum supremacy frontier with a 281 Pflop/s simulation. Quantum Science and Technology, 5(3):034003, April 2020. Publisher: IOP Publishing.
  29. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, April 2009.
  30. Accelerating sparse matrix–matrix multiplication with GPU Tensor Cores. Computers & Electrical Engineering, 88:106848, December 2020.
Citations (28)

Summary

  • The paper presents a new error-correction method that combines external FP32 accumulation with optimized scaling to counteract round-to-zero and underflow issues.
  • It achieves FP32-equivalent accuracy while reaching up to 51 TFlop/s on NVIDIA A100 GPUs, significantly surpassing the FP32 theoretical peak performance.
  • The approach leverages NVIDIA's CUTLASS library for efficient GPU computations, promising enhanced energy efficiency and applicability in scientific and machine learning tasks.

Recovering Single Precision Accuracy from Tensor Cores while Surpassing the FP32 Theoretical Peak Performance

Introduction

The paper "Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance" (2203.03341) addresses the longstanding challenge of utilizing Tensor Cores—a mixed-precision matrix-matrix multiplication unit in NVIDIA GPUs—to recover single-precision accuracy while achieving exceptional performance that exceeds the theoretical peak of FP32 cores. Driven by the need for efficient matrix computations in machine learning and scientific computing domains, the authors propose a technique that leverages advanced error correction mechanisms and optimizes the matrix multiplication process on Tensor Cores, demonstrating significant improvements in throughput and energy efficiency.

Background

Tensor Cores are designed to accelerate mixed-precision matrix computations, primarily with FP16 inputs accumulated into FP32. However, transitioning matrices from single to half precision inherently results in accuracy loss—a critical concern for scientific computing where precision is paramount. Previous approaches, such as those by Markidis et al. and Feng et al., attempted to use auxiliary variables and improved rounding schemes to mitigate these losses. However, these methods fell short of achieving the accuracy seen with traditional FP32 SIMT cores due to inherent limitations in rounding within Tensor Cores and the probability of underflow during computations.

Methodology

The paper introduces a novel method to circumvent these limitations by addressing two primary factors contributing to errors: the default round-to-zero (RZ) rounding mode within Tensor Cores and the high probability of underflow. The authors present theoretical and experimental analyses to determine the expected mantissa length and highlight that mantissa loss is not the core issue affecting accuracy. Instead, the focus shifts to:

  1. Avoiding RZ Rounding: The proposed method accumulates results outside Tensor Cores, utilizing FP32 SIMT cores to ensure rounding is performed in a more accurate round-to-nearest mode. Figure 1

    Figure 1: The flow of computation to avoid RZ inside Tensor Cores, which affects the accuracy of Markidis' error correction method.

  2. Reducing Underflow Probabilities: By scaling the correction terms derived from auxiliary FP16 variables, the technique minimizes underflow occurrences, thereby preserving precision across a broader exponent range. Figure 2 *Figure 2: The theoretical and experimental probability of underflow PuP_\text{u} in Eq. (2) and the sum of underflow and gradual underflow Pu+guP_\text{u+gu}. *
  3. Modification on Error Correction Terms: The authors propose eliminating negligible correction terms, streamlining computations without sacrificing accuracy.

Combining these refinements, the implementation is built upon NVIDIA's CUTLASS library—a template library for efficient GPU computations—which provides the necessary infrastructure for the method's modifications.

Experimental Results

The proposed method exhibits remarkable results, achieving single precision matrix multiplication with the same accuracy as standard FP32 computations, yet surpassing their theoretical peak performance. On NVIDIA A100 GPUs, the implementation reaches 51 TFlop/s when utilizing FP16 Tensor Cores and 33 TFlop/s for TF32 Tensor Cores, considerably exceeding the FP32 SIMT core peak performance of 19.5TFlop/s. Figure 3

Figure 3: Accuracy comparison of matrix multiplication A×B\mathbf{A} \times \mathbf {B} of our method against existing approaches and cuBLAS SGEMM.

Figure 4

Figure 4: Performance comparison of our method in TF32 and FP16, cuBLAS SGEMM and the FP32 theoretical peak.

Conclusion and Future Prospects

The research successfully addresses the accuracy-performance trade-off in utilizing Tensor Cores for scientific computations. The authors achieve FP32-like accuracy while delivering performance metrics surpassing typical theoretical peaks. This accomplishment not only paves the way for broader applications of Tensor Cores in precision-critical computations but also indicates potential advancements in energy-efficient computing solutions.

Future directions include further optimizations to reduce computational overhead introduced by accurate rounding, exploring additional hardware acceleration features, and adapting the technique for diverse application domains beyond linear algebra into areas such as quantum computing and real-time data processing in scientific simulations.

Overall, this paper serves as a significant step forward in harnessing the computational power of Tensor Cores for sophisticated numerical tasks, encouraging the development of hybrid precision strategies that balance performance and precision.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.