DGEMM on Integer Matrix Multiplication Unit (2306.11975v4)
Abstract: Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum circuit simulation by up to 4.33 while maintaining the FP64 accuracy.
- A survey of numerical linear algebra methods utilizing mixed-precision arithmetic. International Journal of High Performance Computing Applications, 35(4):344–369, 2021.
- Answer Fast: Accelerating BERT on the Tensor Streaming Processor. pages 80–87. IEEE Computer Society, July 2022.
- AMD. AMD CDNA™ 2 ARCHITECTURE, 2021.
- Quantum Error Correction in Scrambling Dynamics and Measurement-Induced Phase Transition. Physical Review Letters, 125(3):030505, July 2020. Publisher: American Physical Society.
- Intel Corporation. Intel® C++ Compiler Classic Developer Guide and Reference, 2022.
- Matrix Multiplication in Multiword Arithmetic: Error Analysis and Application to GPU Tensor Cores. SIAM Journal on Scientific Computing, 45(1):C1–C19, February 2023. Publisher: Society for Industrial and Applied Mathematics.
- Quantum Perturbation Theory Using Tensor Cores and a Deep Neural Network. Journal of Chemical Theory and Computation, 18(7):4255–4268, July 2022. Publisher: American Chemical Society.
- Random Quantum Circuits. Annual Review of Condensed Matter Physics, 14(1):335–379, 2023. _eprint: https://doi.org/10.1146/annurev-conmatphys-031720-030658.
- Linley Gwennap. GROQ ROCKS NEURAL NETWORKS. 2020.
- Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613, November 2018.
- Black holes as mirrors: quantum information in random subsystems. Journal of High Energy Physics, 2007(09):120, September 2007.
- Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations. pages 69–76. IEEE Computer Society, June 2019.
- Algorithms for quad-double precision floating point arithmetic. In Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001, pages 155–162, June 2001. ISSN: 1063-6889.
- Mark Horowitz. 1.1 Computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14, February 2014. ISSN: 2376-8606.
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2704–2713, June 2018. ISSN: 2575-7075.
- Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14, June 2021. ISSN: 2575-713X.
- TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings, April 2023. arXiv:2304.01433 [cs].
- In-Datacenter Performance Analysis of a Tensor Processing Unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 1–12, New York, NY, USA, June 2017. Association for Computing Machinery.
- Arm Ltd. Architecture Reference Manual for A-profile architecture (DDI0487), 2022.
- Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures. In Roman Wyrzykowski, Ewa Deelman, Jack Dongarra, and Konrad Karczewski, editors, Parallel Processing and Applied Mathematics, Lecture Notes in Computer Science, pages 516–527, Cham, 2020. Springer International Publishing.
- DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions. In Ponnuswamy Sadayappan, Bradford L. Chamberlain, Guido Juckeland, and Hatem Ltaief, editors, High Performance Computing, Lecture Notes in Computer Science, pages 230–248, Cham, 2020. Springer International Publishing.
- Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pages 1378–1386, May 2012.
- Operator Spreading in Random Unitary Circuits. Physical Review X, 8(2):021014, April 2018. Publisher: American Physical Society.
- Naohito Nakasato. A fast GEMM implementation on the cypress GPU. ACM SIGMETRICS Performance Evaluation Review, 38(4):50–55, March 2011.
- Maho Nakata. MPLAPACK version 2.0.1 user manual, September 2022. arXiv:2109.13406 [cs].
- Corporation NVIDIA. NVIDIA H100 TENSOR CORE GPU, 2022.
- Quantum Circuit Simulation by SGEMM Emulation on Tensor Cores and Automatic Precision Selection. In Abhinav Bhatele, Jeff Hammond, Marc Baboulin, and Carola Kruse, editors, High Performance Computing, Lecture Notes in Computer Science, pages 259–276, Cham, 2023. Springer Nature Switzerland.
- Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance:. The International Journal of High Performance Computing Applications, June 2022. Publisher: SAGE PublicationsSage UK: London, England.
- Generalization of error-free transformation for matrix multiplication and its application. Nonlinear Theory and Its Applications, IEICE, 4(1):2–11, 2013.
- Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numerical Algorithms, 59(1):95–118, January 2012.
- AI and ML Accelerator Survey and Trends. In 2022 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–10, September 2022. ISSN: 2643-1971.
- Accurate Floating-Point Summation Part II: Sign, $K$-Fold Faithful and Rounding to Nearest. SIAM Journal on Scientific Computing, 31(2):1269–1302, December 2008.
- An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks. pages 79–91. IEEE Computer Society, November 2022.
- IBM’s POWER10 Processor. IEEE Micro, 41(2):7–14, March 2021. Conference Name: IEEE Micro.
- Hiroyuki Ootomo (7 papers)
- Katsuhisa Ozaki (8 papers)
- Rio Yokota (64 papers)