Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures (2404.04441v2)

Published 5 Apr 2024 in cs.DC

Abstract: Accelerated computing is widely used in high-performance computing. Therefore, it is crucial to experiment and discover how to better utilize GPUGPUs latest generations on relevant applications. In this paper, we present results and share insights about highly tuned stencil-based kernels for NVIDIA Ampere (A100) and Hopper (GH200) architectures. Performance results yield useful insights into the behavior of this type of algorithms for these new accelerators. This knowledge can be leveraged by many scientific applications which involve stencils computations. Further, evaluation of three different programming models: CUDA, OpenACC, and OpenMP target offloading is conducted on aforementioned accelerators. We extensively study the performance and portability of various kernels under each programming model and provide corresponding optimization recommendations. Furthermore, we compare the performance of different programming models on the mentioned architectures. Up to 58% performance improvement was achieved against the previous GPGPU's architecture generation for an highly optimized kernel of the same class, and up to 42% for all classes. In terms of programming models, and keeping portability in mind, optimized OpenACC implementation outperforms OpenMP implementation by 33%. If portability is not a factor, our best tuned CUDA implementation outperforms the optimized OpenACC one by 2.1x.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. S. Bak, C. Bertoni, S. Boehm, R. D. Budiardja, B. M. Chapman, J. Doerfert, M. Eisenbach, H. Finkel, O. R. Hernandez, J. Huber, S. Iwasaki, V. Kale, P. R. C. Kent, J. Kwack, M. Lin, P. Luszczek, Y. Luo, B. Pham, S. Pophale, K. Ravikumar, V. Sarkar, T. Scogland, S. Tian, and P. K. Yeung, “Openmp application experiences: Porting to accelerated nodes,” Parallel Comput., vol. 109, p. 102856, 2022. [Online]. Available: https://doi.org/10.1016/j.parco.2021.102856
  2. R. de la Cruz and M. Araya-Polo, “Algorithm 942: Semi-stencil,” ACM Trans. Math. Softw., vol. 40, no. 3, apr 2014. [Online]. Available: https://doi.org/10.1145/2591006
  3. A. Denzler, G. F. Oliveira, N. Hajinazar, R. Bera, G. Singh, J. Gómez-Luna, and O. Mutlu, “Casper: Accelerating stencil computations using near-cache processing,” IEEE Access, vol. 11, pp. 22 136–22 154, 2023.
  4. A. Dubey, “Stencils in scientific computations,” in Proceedings of the Second Workshop on Optimizing Stencil Computations, ser. WOSC ’14.   New York, NY, USA: Association for Computing Machinery, 2014, p. 57. [Online]. Available: https://doi.org/10.1145/2686745.2686756
  5. O. Fuhrer, C. Osuna, X. Lapillonne, T. Gysi, B. Cumming, M. Bianco, A. Arteaga, and T. Schulthess, “Towards a performance portable, architecture agnostic implementation strategy for weather and climate models,” Supercomputing Frontiers and Innovations: an International Journal, vol. 1, no. 1, pp. 45–62, Apr. 2014.
  6. T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege, “Split tiling for gpus: Automatic parallelization using trapezoidal tiles,” in Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, ser. GPGPU-6.   New York, NY, USA: Association for Computing Machinery, 2013, p. 24–31. [Online]. Available: https://doi.org/10.1145/2458523.2458526
  7. J. Holewinski, L.-N. Pouchet, and P. Sadayappan, “High-performance code generation for stencil computations on gpu architectures,” in Proceedings of the 26th ACM International Conference on Supercomputing, ser. ICS ’12.   New York, NY, USA: Association for Computing Machinery, 2012, p. 311–320. [Online]. Available: https://doi.org/10.1145/2304576.2304619
  8. M. Jacquelin, M. Araya-Polo, and J. Meng, “Scalable distributed high-order stencil computations,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2022, pp. 1–13.
  9. G. Jin, J. Mellor-Crummey, and R. Fowler, “Increasing temporal locality with skewing and recursive blocking,” in Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, 2001, pp. 43–43.
  10. D. Komatitsch and R. Martin, “An unsplit convolutional perfectly matched layer improved at grazing incidence for the seismic wave equation,” Geophysics, vol. 72, 09 2007.
  11. S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan, “Effective automatic parallelization of stencil computations,” in Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’07.   New York, NY, USA: Association for Computing Machinery, 2007, p. 235–244. [Online]. Available: https://doi.org/10.1145/1250734.1250761
  12. W. Lu, B. Shan, E. Raut, J. Meng, M. Araya-Polo, J. Doerfert, A. M. Malik, and B. Chapman, “Towards efficient remote openmp offloading,” in International Workshop on OpenMP.   Springer, 2022, pp. 17–31.
  13. J. Meng, A. Atle, H. Calandra, and M. Araya-Polo, “Minimod: A finite difference solver for seismic modeling,” 2020.
  14. A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey, “3.5-d blocking optimization for stencil computations on modern cpus and gpus,” in SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2010, pp. 1–13.
  15. NVIDIA, “Cuda zone,” 2023, accessed: August 15, 2023. [Online]. Available: https://developer.nvidia.com/cuda-zone
  16. NVIDIA, “Nvidia h100 tensor core gpu architecture,” 2023, accessed: August 15, 2023. [Online]. Available: https://resources.nvidia.com/en-us-tensor-core
  17. OpenACC-Standard.org, “Openacc,” 2023, accessed: August 14, 2023. [Online]. Available: https://www.openacc.org/
  18. OpenMP.org, “Openmp,” 2023, accessed: August 14, 2023. [Online]. Available: https://www.openmp.org/
  19. R. Sai, J. Mellor-Crummey, X. Meng, M. Araya-Polo, and J. Meng, “Accelerating high-order stencils on gpus,” in 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2020, pp. 86–108.
  20. B. Shan, M. Araya-Polo, A. M. Malik, and B. Chapman, “Mpi-based remote openmp offloading: A more efficient and easy-to-use implementation,” in Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores, ser. PMAM’23, 2023, p. 50–59. [Online]. Available: https://doi.org/10.1145/3582514.3582519
  21. Y. Song and Z. Li, “New tiling techniques to improve cache temporal locality,” in Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, ser. PLDI ’99.   New York, NY, USA: Association for Computing Machinery, 1999, p. 215–228. [Online]. Available: https://doi.org/10.1145/301618.301668
  22. B. Sun, M. Li, H. Yang, J. Xu, Z. Luan, and D. Qian, “Adapting combined tiling to stencil optimizations on sunway processor,” CCF Transactions on High Performance Computing, pp. 1–12, 2023.
  23. Q. Sun, Y. Liu, H. Yang, Z. Jiang, Z. Luan, and D. Qian, “Stencilmart: Predicting optimization selection for stencil computations across gpus,” in 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS).   IEEE, 2022, pp. 875–885.
  24. D. Wonnacott, “Using time skewing to eliminate idle time due to memory bandwidth and network limitations,” in Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000, 2000, pp. 171–180.
  25. D. Wonnacott, “Achieving scalable locality with time skewing,” International Journal of Parallel Programming, vol. 30, 03 1999.
Citations (2)

Summary

We haven't generated a summary for this paper yet.