FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators (2403.00232v1)
Abstract: NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance. Unfortunately, their numerical behaviors are not publicly documented, including the number of extra precision bits maintained, the accumulation order of addition, and predictable subnormal number handling during computations. This makes it impossible to reliably port codes across these differing accelerators. This paper contributes a collection of {\em Feature Targeted Tests for Numerical Properties} that that help determine these features across five floating-point formats, four rounding modes and additional that highlight the rounding behaviors and preservation of extra precision bits. To show the practical relevance of FTTN, we design a simple matrix-multiplication test designed with insights gathered from our feature-tests. We executed this very simple test on five platforms, producing different answers: V100, A100, and MI250X produced 0, MI100 produced 255.875, and Hopper H100 produced 191.875. Our matrix multiplication tests employ patterns found in iterative refinement-based algorithms, highlighting the need to check for significant result variability when porting code across GPUs.
- Guarding numerics amidst rising heterogeneity. In 2021 IEEE/ACM 5th International Workshop on Software Correctness for HPC Applications, pages 9–15, 2021.
- NVIDIA. NVIDIA A100 Tensor Core GPU architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf, 2020.
- AMD. Amd cdna architecture. https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf, 2020.
- Efficient large-scale language model training on gpu clusters using megatron-lm, 2021.
- Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM Journal on Scientific Computing, 2020.
- Numerical algorithms for high-performance computational science. Phil. Trans. R. Soc. A.3782019006620190066, 2020. http://doi.org/10.1098/rsta.2019.0066.
- GOOGLE. Cloud tensor processing units. https://cloud.google.com/tpu/docs/tpus, 2022.
- Mantas Mikaitis. Monotonicity of multi-term floating-point adders, 2023.
- M. Fasiand N.J. Higham and M. Mikaitis and S.Pranesh. Numerical behavior of NVIDIA tensor cores. In PeerJ Comput Sci, February 2021.
- Proposal for a Standardization of Mathematical Function Implementation in Floating-Point Arithmetic. Research Report RR-5406, INRIA, 2004.
- Accuracy of mathematical functions in single, double, double extended, and quadruple precision, 2023.
- Evaluation of performance portability of applications and mini-apps across amd, intel and nvidia gpus. In 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pages 45–56, 2021.
- Precision and performance analysis of c standard math library functions on gpus. In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, pages 892–903, 2023.
- Handbook of Floating-Point Arithmetic. Birkhäuser Basel, 2nd edition, 2018.
- NVIDIA. Cuda floating point and ieee 754, 2024. https://docs.nvidia.com/cuda/floating-point/index.html.
- AMD. ”amd instinct mi100” instruction set architecture reference guide. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/instinct-mi100-cdna1-shader-instruction-set-architecture.pdf, 2020. Accessed: 2023-12-17.
- Nicholas J Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.
- Nvidia. Cuda binary utilities. https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf, 2023. Accessed: 2023-12-17.
- Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613. IEEE, 2018.
- Mixed-precision iterative refinement using tensor cores on gpus to accelerate solution of linear systems. Proceedings of the Royal Society A, 476(2243):20200110, 2020.
- CUDA NVIDIA. Nvidia a100 tensor core gpu architecture. Volume 1.0: Whitepaper, Part, 1:82, 2020.
- Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008.
- Verified Compilation of Floating-Point Computations. Journal of Automated Reasoning (JAR), 54(2):135–163, 2015.