Taking GPU Programming Models to Task for Performance Portability (2402.08950v3)
Abstract: Portability is critical to ensuring high productivity in developing and maintaining scientific software as the diversity in on-node hardware architectures increases. While several programming models provide portability for diverse GPU platforms, they don't make any guarantees about performance portability. In this work, we explore several programming models -- CUDA, HIP, Kokkos, RAJA, OpenMP, OpenACC, and SYCL, to study if the performance of these models is consistently good across NVIDIA and AMD GPUs. We use five proxy applications from different scientific domains, create implementations where missing, and use them to present a comprehensive comparative evaluation of the programming models. We provide a Spack scripting-based methodology to ensure reproducibility of experiments conducted in this work. Finally, we attempt to answer the question -- to what extent does each programming model provide performance portability for heterogeneous systems in real-world usage?
- TOP500.org, “June 2023 top500,” 2023. [Online]. Available: https://www.top500.org/lists/top500/2023/06/
- “Aurora.” [Online]. Available: https://www.alcf.anl.gov/aurora
- “OpenMP Application Program Interface. Version 4.0. July 2013,” 2013.
- R. D. Hornung and J. A. Keasler, “The RAJA Portability Layer: Overview and Status,” Lawrence Livermore National Laboratory, Tech. Rep. LLNL-TR-661403, Sep. 2014.
- C. R. Trott, D. Lebrun-Grandié, D. Arndt, J. Ciesko, V. Dang, N. Ellingwood, R. Gayatri, E. Harvey, D. S. Hollman, D. Ibanez, N. Liber, J. Madsen, J. Miles, D. Poliakoff, A. Powell, S. Rajamanickam, M. Simberg, D. Sunderland, B. Turcksin, and J. Wilke, “Kokkos 3: Programming model extensions for the exascale era,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 4, pp. 805–817, 2022.
- D. A. Beckingsale, M. J. McFadden, J. P. S. Dahm, R. Pankajakshan, and R. D. Hornung, “Umpire: Application-focused management and coordination of complex hierarchical memory,” IBM Journal of Research and Development, vol. 64, no. 3/4, pp. 00:1–00:10, 2020.
- A. Sabne, P. Sakdhnagool, S. Lee, and J. S. Vetter, “Evaluating performance portability of openacc,” in Languages and Compilers for Parallel Computing: 27th International Workshop, LCPC 2014, Hillsboro, OR, USA, September 15-17, 2014, Revised Selected Papers 27. Springer, 2015, pp. 51–66.
- T. Ben-Nun, J. de Fine Licht, A. N. Ziogas, T. Schneider, and T. Hoefler, “Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1–14.
- S. J. Pennycook, J. D. Sewall, and V. W. Lee, “A metric for performance portability,” in Proceedings of the 7th International Workshop in Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, 2016. [Online]. Available: https://arxiv.org/abs/1611.07409
- ——, “Implications of a metric for performance portability,” Future Generation Computer Systems, vol. 92, pp. 947–958, 2019.
- J. Sewall, S. J. Pennycook, D. Jacobsen, T. Deakin, and S. McIntosh-Smith, “Interpreting and visualizing performance portability metrics,” in 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 2020, pp. 14–24.
- S. J. Pennycook, J. D. Sewall, D. W. Jacobsen, T. Deakin, and S. McIntosh-Smith, “Navigating performance, portability, and productivity,” Computing in Science & Engineering, vol. 23, no. 5, pp. 28–38, 2021.
- S. J. Pennycook and J. D. Sewall, “Revisiting a metric for performance portability,” in 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 2021, pp. 1–9.
- D. F. Daniel and J. Panetta, “On applying performance portability metrics,” in 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 2019, pp. 50–59.
- A. Marowka, “A comparison of two performance portability metrics,” Concurrency and Computation: Practice and Experience, p. e7868, 2023.
- ——, “Toward a better performance portability metric,” in 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). IEEE, 2021, pp. 181–184.
- M. Martineau, S. McIntosh-Smith, and W. Gaudin, “Assessing the performance portability of modern parallel programming models using tealeaf,” Concurrency and Computation: Practice and Experience, vol. 29, no. 15, p. e4117, 2017.
- I. Z. Reguly and G. R. Mudalige, “Productivity, performance, and portability for computational fluid dynamics applications,” Computers & Fluids, vol. 199, p. 104425, 2020.
- I. Z. Reguly, “Performance portability of multi-material kernels,” in 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 2019, pp. 26–35.
- A. Sedova, J. D. Eblen, R. Budiardja, A. Tharrington, and J. C. Smith, “High-performance molecular dynamics simulation for biological and materials sciences: Challenges of performance portability,” in 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 2018, pp. 1–13.
- S. Boehm, S. Pophale, V. G. Vergara Larrea, and O. Hernandez, “Evaluating performance portability of accelerator programming models using spec accel 1.2 benchmarks,” in High Performance Computing: ISC High Performance 2018 International Workshops, Frankfurt/Main, Germany, June 28, 2018, Revised Selected Papers 33. Springer, 2018, pp. 711–723.
- A. S. Dufek, R. Gayatri, N. Mehta, D. Doerfler, B. Cook, Y. Ghadar, and C. DeTar, “Case study of using kokkos and sycl as performance-portable frameworks for milc-dslash benchmark on nvidia, amd and intel gpus,” in 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 2021, pp. 57–67.
- T. Deakin, J. Price, M. Martineau, and S. McIntosh-Smith, “Evaluating attainable memory bandwidth of parallel programming models via babelstream,” Int. J. Comput. Sci. Eng., vol. 17, no. 3, p. 247–262, jan 2018.
- V. Artigues, K. Kormann, M. Rampp, and K. Reuter, “Evaluation of performance portability frameworks for the implementation of a particle-in-cell code,” Concurrency and Computation: Practice and Experience, vol. 32, no. 11, p. e5640, 2020.
- R. Gayatri, C. Yang, T. Kurth, and J. Deslippe, “A case study for performance portability using openmp 4.5,” in Accelerator Programming Using Directives: 5th International Workshop, WACCPD 2018, Dallas, TX, USA, November 11-17, 2018, Proceedings 5. Springer, 2019, pp. 75–95.
- H. Brunst, S. Chandrasekaran, F. M. Ciorba, N. Hagerty, R. Henschel, G. Juckeland, J. Li, V. G. M. Vergara, S. Wienke, and M. Zavala, “First experiences in performance benchmarking with the new spechpc 2021 suites,” in 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 2022, pp. 675–684.
- G. K. Reddy Kuncham, R. Vaidya, and M. Barve, “Performance study of gpu applications using sycl and cuda on tesla v100 gpu,” in 2021 IEEE High Performance Extreme Computing Conference (HPEC), 2021, pp. 1–7.
- T. Deakin, S. McIntosh-Smith, J. Price, A. Poenaru, P. Atkinson, C. Popa, and J. Salmon, “Performance portability across diverse computer architectures,” in 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 2019, pp. 1–13.
- T. Deakin, A. Poenaru, T. Lin, and S. McIntosh-Smith, “Tracking performance portability on the yellow brick road to exascale,” in 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 2020, pp. 1–13.
- T. Deakin, S. McIntosh-Smith, S. J. Pennycook, and J. Sewall, “Analyzing reduction abstraction capabilities,” in 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 2021, pp. 33–44.
- T. Deakin, J. Cownie, W.-C. Lin, and S. McIntosh-Smith, “Heterogeneous programming for the homogeneous majority,” in 2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 2022, pp. 1–13.
- J. Kwack, J. Tramm, C. Bertoni, Y. Ghadar, B. Homerding, E. Rangel, C. Knight, and S. Parker, “Evaluation of performance portability of applications and mini-apps across amd, intel and nvidia gpus,” in 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 2021, pp. 45–56.
- S. L. Harrell, J. Kitson, R. Bird, S. J. Pennycook, J. Sewall, D. Jacobsen, D. N. Asanza, A. Hsu, H. C. Carrillo, H. Kim et al., “Effective performance portability,” in 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, 2018, pp. 24–36.
- A. Munshi, “The opencl specification,” in 2009 IEEE Hot Chips 21 Symposium (HCS). IEEE, 2009, pp. 1–314.
- D. S. Medina, A. St-Cyr, and T. Warburton, “Occa: A unified approach to multi-threading languages,” arXiv preprint arXiv:1403.0968, 2014.
- E. Zenker, B. Worpitz, R. Widera, A. Huebl, G. Juckeland, A. Knüpfer, W. E. Nagel, and M. Bussmann, “Alpaka–an abstraction library for parallel kernel acceleration,” in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2016, pp. 631–640.
- “Ecp proxy applications,” https://proxyapps.exascaleproject.org/, accessed: 2023-09-30.
- “Nersc proxy suite,” https://www.nersc.gov/research-and-development/nersc-proxy-suite/.
- M. A. Heroux, R. F. Barrett, J. M. Willenbring, S. D. Hammond, D. Richards, J. Mohd-Yusof, and A. Herdman, “Mantevo suite 1.0.” Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), Tech. Rep., 2013.
- J. H. Davis, C. Daley, S. Pophale, T. Huber, S. Chandrasekaran, and N. J. Wright, “Performance assessment of openmp compilers targeting nvidia v100 gpus,” in Accelerator Programming Using Directives, S. Bhalachandra, S. Wienke, S. Chandrasekaran, and G. Juckeland, Eds. Cham: Springer International Publishing, 2021, pp. 25–44.
- J. R. Tramm, A. R. Siegel, T. Islam, and M. Schulz, “Xsbench-the development and verification of a performance abstraction for monte carlo reactor analysis,” The Role of Reactor Physics toward a Sustainable Future (PHYSOR), 2014.
- D. Doerfler and C. Daley, “su3_bench: Lattice qcd su (3) matrix-matrix multiply microbenchmark (su3_bench) v1. 0,” Lawrence Berkeley National Lab.(LBNL), Berkeley, CA (United States), Tech. Rep., 2020.
- T. Gamblin, M. LeGendre, M. R. Collette, G. L. Lee, A. Moody, B. R. de Supinski, and S. Futral, “The spack package manager: bringing order to hpc software chaos,” in SC15: International Conference for High-Performance Computing, Networking, Storage and Analysis. Los Alamitos, CA, USA: IEEE Computer Society, nov 2015. [Online]. Available: https://doi.ieeecomputersociety.org/10.1145/2807591.2807623