Julia as a unifying end-to-end workflow language on the Frontier exascale system (2309.10292v3)
Abstract: We evaluate Julia as a single language and ecosystem paradigm powered by LLVM to develop workflow components for high-performance computing. We run a Gray-Scott, 2-variable diffusion-reaction application using a memory-bound, 7-point stencil kernel on Frontier, the US Department of Energy's first exascale supercomputer. We evaluate the performance, scaling, and trade-offs of (i) the computational kernel on AMD's MI250x GPUs, (ii) weak scaling up to 4,096 MPI processes/GPUs or 512 nodes, (iii) parallel I/O writes using the ADIOS2 library bindings, and (iv) Jupyter Notebooks for interactive analysis. Results suggest that although Julia generates a reasonable LLVM-IR, a nearly 50% performance difference exists vs. native AMD HIP stencil codes when running on the GPUs. As expected, we observed near-zero overhead when using MPI and parallel I/O bindings for system-wide installed implementations. Consequently, Julia emerges as a compelling high-performance and high-productivity workflow composition language, as measured on the fastest supercomputer in the world.
- AMD. 2022. AMD ROCm v5.2 Release. https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html#amd-rocm-v5-2-release
- Utkarsh Ayachit. 2015. The ParaView guide: a parallel visualization application. Kitware, Inc.
- RAJA: Portable Performance for Large-Scale Scientific Applications. In 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 71–81. https://doi.org/10.1109/P3HPC49587.2019.00012
- Workflows are the New Applications: Challenges in Performance, Portability, and Productivity. In 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). 57–69. https://doi.org/10.1109/P3HPC51967.2020.00011
- Julia: A Fresh Approach to Numerical Computing. SIAM Rev. 59, 1 (Jan. 2017), 65–98. https://doi.org/10.1137/141000671
- MPI.jl: Julia bindings for the Message Passing Interface. Proceedings of the JuliaCon Conferences 1, 1 (2021), 68. https://doi.org/10.21105/jcon.00068
- Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel and Distrib. Comput. 74, 12 (2014), 3202–3216. https://doi.org/10.1016/j.jpdc.2014.07.003 Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.
- Bridging HPC Communities through the Julia Programming Language. arXiv:2211.02740 [cs.DC]
- Simon Danisch and Julius Krumbiegel. 2021. Makie.jl: Flexible high-performance data visualization for Julia. Journal of Open Source Software 6, 65 (2021), 3349. https://doi.org/10.21105/joss.03349
- Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. 1–12. https://doi.org/10.1109/SC.2008.5222004
- The future of scientific workflows. The International Journal of High Performance Computing Applications 32, 1 (2018), 159–175. https://doi.org/10.1177/1094342017704893 arXiv:https://doi.org/10.1177/1094342017704893
- The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience 15, 9 (2003), 803–820.
- Flexible Performant GEMM Kernels on GPUs. IEEE Transactions on Parallel and Distributed Systems 33, 9 (2022), 2230–2248. https://doi.org/10.1109/TPDS.2021.3136457
- Workflows Community Summit 2022: A Roadmap Revolution. Technical Report ORNL/TM-2023/2885. Oak Ridge National Laboratory. https://doi.org/10.5281/zenodo.7750670
- A Community Roadmap for Scientific Workflows Research and Development. In 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS). 81–90. https://doi.org/10.1109/WORKS54523.2021.00016
- A Characterization of Workflow Management Systems for Extreme-Scale Applications. Future Generation Computer Systems 75 (2017), 228–238. https://doi.org/10.1016/j.future.2017.02.026
- Productivity meets Performance: Julia on A64FX. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 549–555. https://doi.org/10.1109/CLUSTER51413.2022.00072
- FAIR computational workflows. Data Intelligence 2, 1-2 (2020), 108–121.
- William F. Godoy and Xu Liu. 2011. Introduction of Parallel GPGPU Acceleration Algorithms for the Solution of Radiative Transfer. Numerical Heat Transfer, Part B: Fundamentals 59, 1 (2011), 1–25. https://doi.org/10.1080/10407790.2010.541359
- Adios 2: The Adaptable Input Output System. A framework for high-performance data management. SoftwareX 12 (2020), 100561. https://doi.org/10.1016/j.softx.2020.100561
- Evaluating performance and portability of high-level programming models: Julia, Python/Numba, and Kokkos on exascale nodes. In 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 373–382. https://doi.org/10.1109/IPDPSW59300.2023.00068
- Using MPI-2: Advanced features of the message passing interface. MIT press.
- High-Performance Code Generation for Stencil Computations on GPU Architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (San Servolo Island, Venice, Italy) (ICS ’12). Association for Computing Machinery, New York, NY, USA, 311–320. https://doi.org/10.1145/2304576.2304619
- Sascha Hunold and Sebastian Steiner. 2020. Benchmarking Julia’s Communication Performance: Is Julia HPC ready or Full HPC?. In 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE. https://doi.org/10.1109/pmbs51919.2020.00008
- Marcin Krotkiewski and Marcin Dabrowski. 2013. Efficient 3D stencil computations using CUDA. Parallel Comput. 39, 10 (2013), 533–548. https://doi.org/10.1016/j.parco.2013.08.002
- Numba: A LLVM-based Python JIT compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. 1–6.
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 75–86.
- Wei-Chen Lin and Simon McIntosh-Smith. 2021. Comparing Julia to Performance Portable Parallel Programming Models for HPC. In 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). 94–105. https://doi.org/10.1109/PMBS54543.2021.00016
- Bring the BitCODE-Moving Compute and Data in Distributed Heterogeneous Systems. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 12–22. https://doi.org/10.1109/CLUSTER51413.2022.00017
- John D McCalpin et al. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE computer society technical committee on computer architecture (TCCA) newsletter 2, 19-25 (1995).
- NVIDIA. 2022. CUDA Toolkit Documentation - v11.7.0. https://developer.nvidia.com/cuda-toolkit
- OpenMP Architecture Review Board. 2021. OpenMP Application Program Interface Version 5.2. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf
- John E. Pearson. 1993. Complex Patterns in a Simple System. Science 261, 5118 (1993), 189–192. https://doi.org/10.1126/science.261.5118.189 arXiv:https://www.science.org/doi/pdf/10.1126/science.261.5118.189
- Transitioning from File-Based HPC Workflows to Streaming Data Pipelines with openPMD and ADIOS2. In Driving Scientific and Engineering Discoveries Through the Integration of Experiment, Big Data, and Modeling and Simulation, Jeffrey Nichols, Arthur ‘Barney’ Maccabe, James Nutaro, Swaroop Pophale, Pravallika Devineni, Theresa Ahearn, and Becky Verastegui (Eds.). Springer International Publishing, Cham, 99–118.
- Fides: a general purpose data model library for streaming data. In High Performance Computing: ISC High Performance Digital 2021 International Workshops, Frankfurt am Main, Germany, June 24–July 2, 2021, Revised Selected Papers 36. Springer, 495–507.
- Adaptive numerical simulations with Trixi.jl: A case study of Julia for scientific computing. Proceedings of the JuliaCon Conferences 1, 1 (2022), 77. https://doi.org/10.21105/jcon.00077
- Cataloging the visible universe through Bayesian inference in Julia at petascale. J. Parallel and Distrib. Comput. 127 (2019), 89–104. https://doi.org/10.1016/j.jpdc.2018.12.008
- JuliaGPU/AMDGPU.jl: v0.4.1. https://doi.org/10.5281/zenodo.6949520
- Visualizing with VTK: a tutorial. IEEE Computer graphics and applications 20, 5 (2000), 20–27.
- Large-Scale Simulation of Quantum Computational Chemistry on a New Sunway Supercomputer. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14. https://doi.org/10.1109/SC41404.2022.00019
- Automatic, Efficient and Scalable Provenance Registration for FAIR HPC Workflows. In 2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS). 1–9. https://doi.org/10.1109/WORKS56498.2022.00006
- Python Workflows on HPC Systems. In 2020 IEEE/ACM 9th Workshop on Python for High-Performance and Scientific Computing (PyHPC). 32–40. https://doi.org/10.1109/PyHPC51966.2020.00009
- Extreme Heterogeneity 2018 - Productive Computational Science in the Era of Extreme Heterogeneity: Report for DOE ASCR Workshop on Extreme Heterogeneity. (12 2018). https://doi.org/10.2172/1473756
- GPU-Aware Non-contiguous Data Movement In Open MPI. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. 231–242.