Automated MPI-X code generation for scalable finite-difference solvers (2312.13094v4)
Abstract: Partial differential equations (PDEs) are crucial in modeling diverse phenomena across scientific disciplines, including seismic and medical imaging, computational fluid dynamics, image processing, and neural networks. Solving these PDEs at scale is an intricate and time-intensive process that demands careful tuning. This paper introduces automated code-generation techniques specifically tailored for distributed memory parallelism (DMP) to execute explicit finite-difference (FD) stencils at scale, a fundamental challenge in numerous scientific applications. These techniques are implemented and integrated into the Devito DSL and compiler framework, a well-established solution for automating the generation of FD solvers based on a high-level symbolic math input. Users benefit from modeling simulations for real-world applications at a high-level symbolic abstraction and effortlessly harnessing HPC-ready distributed-memory parallelism without altering their source code. This results in drastic reductions both in execution time and developer effort. A comprehensive performance evaluation of Devito's DMP via MPI demonstrates highly competitive strong and weak scaling on CPU and GPU clusters, proving its effectiveness and capability to meet the demands of large-scale scientific simulations.
- F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. Mcrae, G.-T. Bercea, G. R. Markall, and P. H. J. Kelly, “Firedrake: Automating the finite element method by composing abstractions,” ACM Trans. Math. Softw., vol. 43, no. 3, dec 2016. [Online]. Available: https://doi.org/10.1145/2998441
- S. Balay, S. Abhyankar, M. F. Adams, S. Benson, J. Brown, P. Brune, K. Buschelman, E. Constantinescu, L. Dalcin, A. Dener, V. Eijkhout, J. Faibussowitsch, W. D. Gropp, V. Hapla, T. Isaac, P. Jolivet, D. Karpeev, D. Kaushik, M. G. Knepley, F. Kong, S. Kruger, D. A. May, L. C. McInnes, R. T. Mills, L. Mitchell, T. Munson, J. E. Roman, K. Rupp, P. Sanan, J. Sarich, B. F. Smith, S. Zampini, H. Zhang, H. Zhang, and J. Zhang, “PETSc/TAO users manual,” Argonne National Laboratory, Tech. Rep. ANL-21/39 - Revision 3.20, 2023.
- R. J. Hewett and T. J. G. I. au2, “A linear algebraic approach to model parallelism in deep learning,” 2020. [Online]. Available: https://arxiv.org/abs/2006.03108
- T. J. Grady, R. Khan, M. Louboutin, Z. Yin, P. A. Witte, R. Chandra, R. J. Hewett, and F. J. Herrmann, “Model-parallel fourier neural operators as learned surrogates for large-scale parametric pdes,” Computers & Geosciences, vol. 178, p. 105402, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0098300423001061
- F. Luporini, M. Louboutin, M. Lange, N. Kukreja, P. Witte, J. Hückelheim, C. Yount, P. H. J. Kelly, F. J. Herrmann, and G. J. Gorman, “Architecture and performance of devito, a system for automated stencil computation,” ACM Trans. Math. Softw., vol. 46, no. 1, apr 2020. [Online]. Available: https://doi.org/10.1145/3374916
- M. Louboutin, M. Lange, F. Luporini, N. Kukreja, P. A. Witte, F. J. Herrmann, P. Velesko, and G. J. Gorman, “Devito (v3.1.0): an embedded domain-specific language for finite differences and geophysical exploration,” Geoscientific Model Development, vol. 12, no. 3, pp. 1165–1187, 2019. [Online]. Available: https://www.geosci-model-dev.net/12/1165/2019/
- P. A. Witte, M. Louboutin, N. Kukreja, F. Luporini, M. Lange, G. J. Gorman, and F. J. Herrmann, “A large-scale framework for symbolic implementations of seismic inversion algorithms in julia,” GEOPHYSICS, vol. 84, no. 3, pp. F57–F71, 2019. [Online]. Available: https://doi.org/10.1190/geo2018-0174.1
- C. Cueto, L. Guasch, F. Luporini, O. Bates, G. Strong, O. C. Agudo, J. Cudeiro, P. Kelly, G. Gorman, and M.-X. Tang, “Tomographic ultrasound modelling and imaging with Stride and Devito,” in Medical Imaging 2022: Ultrasonic Imaging and Tomography, N. Bottenus and N. V. Ruiter, Eds., vol. PC12038, International Society for Optics and Photonics. SPIE, 2022, p. PC1203805. [Online]. Available: https://doi.org/10.1117/12.2611072
- C. Cueto, O. Bates, G. Strong, J. Cudeiro, F. Luporini, Òscar Calderón Agudo, G. Gorman, L. Guasch, and M.-X. Tang, “Stride: A flexible software platform for high-performance ultrasound computed tomography,” Computer Methods and Programs in Biomedicine, vol. 221, p. 106855, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0169260722002371
- A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh, T. Rathnayake, S. Vig, B. E. Granger, R. P. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pedregosa, M. J. Curry, A. R. Terrel, v. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, and A. Scopatz, “Sympy: symbolic computing in python,” PeerJ Computer Science, vol. 3, p. e103, Jan. 2017. [Online]. Available: https://doi.org/10.7717/peerj-cs.103
- L. Dalcin and Y.-L. L. Fang, “Mpi4py: Status update after 12 years of development,” Computing in Science and Engg., vol. 23, no. 4, p. 47–54, jul 2021. [Online]. Available: https://doi.org/10.1109/MCSE.2021.3083216
- G. Bisbas, F. Luporini, M. Louboutin, R. Nelson, G. J. Gorman, and P. H. Kelly, “Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources,” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021, pp. 497–506. [Online]. Available: https://ieeexplore.ieee.org/document/9460483
- M. Louboutin, M. Lange, F. J. Herrmann, N. Kukreja, and G. Gorman, “Performance prediction of finite-difference solvers for different computer architectures,” Computers & Geosciences, vol. 105, pp. 148–157, 2017. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0098300416304034
- S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, p. 65–76, apr 2009. [Online]. Available: https://doi.org/10.1145/1498765.1498785
- Y. Zhang, H. Zhang, and G. Zhang, “A stable tti reverse time migration and its implementation,” Geophysics, vol. 76, no. 3, pp. WA3–WA11, 2011. [Online]. Available: https://library.seg.org/doi/10.1190/1.3554411
- M. Louboutin, P. Witte, and F. J. Herrmann, “Effects of wrong adjoints for rtm in tti media,” in SEG Technical Program Expanded Abstracts 2018. Society of Exploration Geophysicists, 2018, pp. 331–335. [Online]. Available: https://library.seg.org/doi/10.1190/segam2018-2996274.1
- E. Duveneck and P. M. Bakker, “Stable p-wave modeling for reverse-time migration in tilted ti media,” GEOPHYSICS, vol. 76, no. 2, pp. S65–S75, 2011. [Online]. Available: https://doi.org/10.1190/1.3533964
- T. Alkhalifah, “An acoustic wave equation for anisotropic media,” Geophysics, vol. 65, pp. 1239–1250, 2000. [Online]. Available: https://library.seg.org/doi/10.1190/1.1444815
- K. Bube, J. Washbourne, R. Ergas, and T. Nemeth, “Self-adjoint, energy-conserving second-order pseudoacoustic systems for vti and tti media for reverse time migration and full-waveform inversion,” in SEG Technical Program Expanded Abstracts 2016. Society of Exploration Geophysicists, 2016, pp. 1110–1114. [Online]. Available: https://library.seg.org/doi/10.1190/segam2016-13878451.1
- J. O. A. Robertson, J. O. Blanch, and W. W. Symes, “Viscoelastic finite-difference modeling,” Geophysics, vol. 59, no. 9, p. 1444–1456, 1994. [Online]. Available: https://library.seg.org/doi/abs/10.1190/1.1443701
- A. Gholamy and V. Kreinovich, “Why ricker wavelets are successful in processing seismic data: Towards a theoretical explanation,” in 2014 IEEE Symposium on Computational Intelligence for Engineering Solutions (CIES), 2014, pp. 11–16. [Online]. Available: https://ieeexplore.ieee.org/document/7011824
- T. Zhao, S. Williams, M. Hall, and H. Johansen, “Delivering performance-portable stencil computations on cpus and gpus using bricks,” in 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), 2018, pp. 59–70.
- H. Wang and A. Chandramowlishwaran, “Pencil: A pipelined algorithm for distributed stencils,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’20. IEEE Press, 2020. [Online]. Available: https://dl.acm.org/doi/10.5555/3433701.3433814
- T. Malas, G. Hager, H. Ltaief, H. Stengel, G. Wellein, and D. Keyes, “Multicore-optimized wavefront diamond blocking for optimizing stencil updates,” SIAM J. Sci. Comput., vol. 37, no. 4, p. C439–C464, jan 2015. [Online]. Available: https://doi.org/10.1137/140991133
- N. Maruyama, T. Nomura, K. Sato, and S. Matsuoka, “Physis: An implicitly parallel programming model for stencil computations on large-scale gpu-accelerated supercomputers,” in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’11. New York, NY, USA: Association for Computing Machinery, 2011. [Online]. Available: https://doi.org/10.1145/2063384.2063398
- C. Yount, J. Tobin, A. Breuer, and A. Duran, “Yask-yet another stencil kernel: A framework for hpc stencil code-generation and tuning,” in Proceedings of the Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for HPC, ser. WOLFHPC ’16. IEEE Press, 2016, p. 30–39. [Online]. Available: https://dl.acm.org/doi/10.5555/3019129.3019133
- A. Afanasyev, M. Bianco, L. Mosimann, C. Osuna, F. Thaler, H. Vogt, O. Fuhrer, J. VandeVondele, and T. C. Schulthess, “Gridtools: A framework for portable weather and climate applications,” SoftwareX, vol. 15, p. 100707, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2352711021000522
- S. Omlin, L. Räss, and I. Utkin, “Distributed parallelization of xpu stencil computations in julia,” arXiv preprint arXiv:2211.15716, 2022. [Online]. Available: https://arxiv.org/abs/2211.15716
- J. D. Betteridge, P. E. Farrell, and D. A. Ham, “Code generation for productive, portable, and scalable finite element simulation in firedrake,” Computing in Science & Engineering, vol. 23, no. 4, pp. 8–17, 2021. [Online]. Available: https://ieeexplore.ieee.org/document/9447889
- G. R. Mudalige, I. Z. Reguly, A. Prabhakar, D. Amirante, L. Lapworth, and S. A. Jarvis, “Towards virtual certification of gas turbine engines with performance-portable simulations,” in 2022 IEEE International Conference on Cluster Computing (CLUSTER), 2022, pp. 206–217. [Online]. Available: https://ieeexplore.ieee.org/document/9912706
- P. Vincent, F. Witherden, B. Vermeire, J. S. Park, and A. Iyer, “Towards green aviation with python at petascale,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’16. IEEE Press, 2016. [Online]. Available: https://dl.acm.org/doi/10.5555/3014904.3014906
- S. Adams, R. Ford, M. Hambley, J. Hobson, I. Kavčič, C. Maynard, T. Melvin, E. Müller, S. Mullerworth, A. Porter, M. Rezny, B. Shipway, and R. Wong, “Lfric: Meeting the challenges of scalability and performance portability in weather and climate models,” Journal of Parallel and Distributed Computing, vol. 132, pp. 383–396, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731518305306
- T. Denniston, S. Kamil, and S. Amarasinghe, “Distributed halide,” in Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’16. New York, NY, USA: Association for Computing Machinery, 2016. [Online]. Available: https://doi.org/10.1145/2851141.2851157
- J. Pekkilä, M. S. Väisälä, M. J. Käpylä, M. Rheinhardt, and O. Lappi, “Scalable communication for high-order stencil computations using cuda-aware mpi,” Parallel Computing, vol. 111, p. 102904, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819122000102
- I. Zacharoudiou, J. McCullough, and P. Coveney, “Development and performance of a hemelb gpu code for human-scale blood flow simulation,” Computer Physics Communications, vol. 282, p. 108548, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0010465522002673
- C. T. Jacobs, S. P. Jammy, and N. D. Sandham, “Opensbli: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures,” Journal of Computational Science, vol. 18, pp. 12–23, 2017. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S187775031630299X
- D. J. Lusher, S. P. Jammy, and N. D. Sandham, “Opensbli: Automated code-generation for heterogeneous computing architectures applied to compressible fluid dynamics on structured grids,” Computer Physics Communications, vol. 267, p. 108063, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0010465521001752
- I. Z. Reguly, G. R. Mudalige, and M. B. Giles, “Loop tiling in large-scale stencil codes at run-time with ops,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 4, pp. 873–886, 2018.
- G. Mudalige, I. Reguly, S. Jammy, C. Jacobs, M. Giles, and N. Sandham, “Large-scale performance of a dsl-based multi-block structured-mesh application for direct numerical simulation,” Journal of Parallel and Distributed Computing, vol. 131, pp. 130–146, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0743731518305690
- G. Mudalige, M. Giles, I. Reguly, C. Bertolli, and P. Kelly, “Op2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures,” in 2012 Innovative Parallel Computing (InPar), 2012, pp. 1–12. [Online]. Available: https://ieeexplore.ieee.org/document/6339594
- S. Macià, P. J. Martínez-Ferrer, E. Ayguadé, and V. Beltran, “Assessing saiph, a task-based dsl for high-performance computational fluid dynamics,” Future Generation Computer Systems, vol. 147, pp. 235–250, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167739X23001759
- M. Jacquelin, M. Araya-Polo, and J. Meng, “Scalable distributed high-order stencil computations,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’22. IEEE Press, 2022. [Online]. Available: https://dl.acm.org/doi/abs/10.5555/3571885.3571924
- H. Ltaief, Y. Hong, L. Wilson, M. Jacquelin, M. Ravasi, and D. E. Keyes, “Scaling the “memory wall” for multi-dimensional seismic processing with algebraic compression on cerebras cs-2 systems,” 2023. [Online]. Available: http://hdl.handle.net/10754/694388
- J. Virieux, “P-sv wave propagation in heterogeneous media: Velocity-stress finite-difference method,” Geophysics, vol. 51, no. 4, pp. 889–901, 1986. [Online]. Available: https://doi.org/10.1190/1.1442147
- F. Luporini, M. Louboutin, M. Lange, N. Kukreja, rhodrin, G. Bisbas, V. Pandolfo, L. Cavalcante, T. Burgess, G. Gorman, and K. Hester, “devitocodes/devito: v4.7.1,” Aug. 2022. [Online]. Available: https://doi.org/10.5281/zenodo.6958070