Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 98 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators (2403.00232v1)

Published 1 Mar 2024 in cs.AR

Abstract: NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance. Unfortunately, their numerical behaviors are not publicly documented, including the number of extra precision bits maintained, the accumulation order of addition, and predictable subnormal number handling during computations. This makes it impossible to reliably port codes across these differing accelerators. This paper contributes a collection of {\em Feature Targeted Tests for Numerical Properties} that that help determine these features across five floating-point formats, four rounding modes and additional that highlight the rounding behaviors and preservation of extra precision bits. To show the practical relevance of FTTN, we design a simple matrix-multiplication test designed with insights gathered from our feature-tests. We executed this very simple test on five platforms, producing different answers: V100, A100, and MI250X produced 0, MI100 produced 255.875, and Hopper H100 produced 191.875. Our matrix multiplication tests employ patterns found in iterative refinement-based algorithms, highlighting the need to check for significant result variability when porting code across GPUs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Guarding numerics amidst rising heterogeneity. In 2021 IEEE/ACM 5th International Workshop on Software Correctness for HPC Applications, pages 9–15, 2021.
  2. NVIDIA. NVIDIA A100 Tensor Core GPU architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf, 2020.
  3. AMD. Amd cdna architecture. https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf, 2020.
  4. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021.
  5. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM Journal on Scientific Computing, 2020.
  6. Numerical algorithms for high-performance computational science. Phil. Trans. R. Soc. A.3782019006620190066, 2020. http://doi.org/10.1098/rsta.2019.0066.
  7. GOOGLE. Cloud tensor processing units. https://cloud.google.com/tpu/docs/tpus, 2022.
  8. Mantas Mikaitis. Monotonicity of multi-term floating-point adders, 2023.
  9. M. Fasiand N.J. Higham and M. Mikaitis and S.Pranesh. Numerical behavior of NVIDIA tensor cores. In PeerJ Comput Sci, February 2021.
  10. Proposal for a Standardization of Mathematical Function Implementation in Floating-Point Arithmetic. Research Report RR-5406, INRIA, 2004.
  11. Accuracy of mathematical functions in single, double, double extended, and quadruple precision, 2023.
  12. Evaluation of performance portability of applications and mini-apps across amd, intel and nvidia gpus. In 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pages 45–56, 2021.
  13. Precision and performance analysis of c standard math library functions on gpus. In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, pages 892–903, 2023.
  14. Handbook of Floating-Point Arithmetic. Birkhäuser Basel, 2nd edition, 2018.
  15. NVIDIA. Cuda floating point and ieee 754, 2024. https://docs.nvidia.com/cuda/floating-point/index.html.
  16. AMD. ”amd instinct mi100” instruction set architecture reference guide. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/instinct-mi100-cdna1-shader-instruction-set-architecture.pdf, 2020. Accessed: 2023-12-17.
  17. Nicholas J Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.
  18. Nvidia. Cuda binary utilities. https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf, 2023. Accessed: 2023-12-17.
  19. Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613. IEEE, 2018.
  20. Mixed-precision iterative refinement using tensor cores on gpus to accelerate solution of linear systems. Proceedings of the Royal Society A, 476(2243):20200110, 2020.
  21. CUDA NVIDIA. Nvidia a100 tensor core gpu architecture. Volume 1.0: Whitepaper, Part, 1:82, 2020.
  22. Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008.
  23. Verified Compilation of Floating-Point Computations. Journal of Automated Reasoning (JAR), 54(2):135–163, 2015.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube