Auto Tuning for OpenMP Dynamic Scheduling applied to FWI (2402.16728v1)
Abstract: Because Full Waveform Inversion (FWI) works with a massive amount of data, its execution requires much time and computational resources, being restricted to large-scale computer systems such as supercomputers. Techniques such as FWI adapt well to parallel computing and can be parallelized in shared memory systems using the application programming interface (API) OpenMP. The management of parallel tasks can be performed through loop schedulers contained in OpenMP. The dynamic scheduler stands out for distributing predefined fixed-size chunk sizes to idle processing cores at runtime. It can better adapt to FWI, where data processing can be irregular. However, the relationship between the size of the chunk size and the runtime is unknown. Optimization techniques can employ meta-heuristics to explore the parameter search space, avoiding testing all possible solutions. Here, we propose a strategy to use the Parameter Auto Tuning for Shared Memory Algorithms (PATSMA), with Coupled Simulated Annealing (CSA) as its optimization method, to automatically adjust the chunk size for the dynamic scheduling of wave propagation, one of the most expensive steps in FWI. Since testing each candidate chunk size in the complete FWI is unpractical, our approach consists of running a PATSMA where the objective function is the runtime of the first time iteration of the first seismic shot of the first FWI iteration. The resulting chunk size is then employed in all wave propagations involved in an FWI. We conducted tests to measure the runtime of an FWI using the proposed autotuning, varying the problem size and running on different computational environments, such as supercomputers and cloud computing instances. The results show that applying the proposed autotuning in an FWI reduces its runtime by up to 70.46% compared to standard OpenMP schedulers.
- Characterization and Optimization Methodology Applied to Stencil Computations, in: Jeffers, J., Reinders, J. (Eds.), High Performance Parallelism Pearls. Elsevier. chapter 23, pp. 377–396. URL: http://www.sciencedirect.com/science/article/pii/B9780128021187000236, doi:10.1016/B978-0-12-802118-7.00023-6.
- Genetic Algorithm Based Auto-Tuning of Seismic Applications on Multi and Manycore Computers, in: EAGE Workshop on High Performance Computing for Upstream, p. 0. doi:10.3997/2214-4609.20141920.
- Auto-tuning of dynamic scheduling applied to 3d reverse time migration on multicore systems. IEEE Access 8, 145115–145127. doi:10.1109/ACCESS.2020.3015045.
- Distributed-memory load balancing with cyclic token-based work-stealing applied to reverse time migration. IEEE Access 7, 128419–128430. doi:10.1109/ACCESS.2019.2939100.
- Multi-level load balancing with an integrated runtime approach, in: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, IEEE Press. p. 31–40. URL: https://doi.org/10.1109/CCGRID.2018.00018, doi:10.1109/CCGRID.2018.00018.
- Auto-tuning of 3d acoustic wave propagation in shared memory environments, in: "", EarthDoc. p. 0. URL: http://www.earthdoc.org/publication/publicationdetails/?publication=94579, doi:10.3997/2214-4609.201803072.
- Reverse time migration. Geophysics 48, 1514–1524. doi:10.1190/1.1441434.
- Improving the i/o performance of applications with predictive modeling based auto-tuning, in: 2021 International Conference on Engineering and Emerging Technologies (ICEET), pp. 1–6. doi:10.1109/ICEET53442.2021.9659711.
- Dynamic Tuning and Reconfiguration of the I/O Forwarding Layer in HPC Platforms. Ph.D. thesis. ’Universidade Federal do Rio Grande do Sul’. doi:10.13140/RG.2.2.18591.48802.
- OpenMP specifications. URL: https://www.openmp.org/specifications/. version 5.1.
- OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 46–55. doi:10.1109/99.660313.
- An adaptive self-scheduling loop scheduler. Concurrency and Computation: Practice and Experience 34, e6750. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.6750, doi:https://doi.org/10.1002/cpe.6750, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.6750.
- Pattern-based autotuning of openmp loops using graph neural networks, in: 2022 IEEE/ACM International Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S), pp. 26--31. doi:10.1109/AI4S56813.2022.00010.
- Automatic scheduler for 3d seismic modeling by finite differences, in: ’’, p. 0. URL: https://stt.ibp.org.br/eventos/2018/riooil2018/pdfs/Riooil2018_1901_201806151345riooeg_end_paper.pdf.
- Patsma: Parameter auto-tuning for shared memory algorithms. arXiv:2401.07861.
- Algorithm 799: Revolve: An implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Trans. Math. Softw. 26, 19–45. URL: https://doi.org/10.1145/347837.347846, doi:10.1145/347837.347846.
- Auto-tuning of large-scale iterative operations on modern streaming platforms, in: Proceedings of the 16th International Conference on Emerging Networking EXperiments and Technologies, Association for Computing Machinery, New York, NY, USA. p. 554–555. URL: https://doi.org/10.1145/3386367.3431680, doi:10.1145/3386367.3431680.
- Locality-optimized mixed static/dynamic scheduling for improving load balancing on SMPs, in: Proceedings of the 21st European MPI Users’ Group Meeting, pp. 115--116.
- Auto-tuning of computation kernels from an FDM code with ppOpen-AT, in: Proceedings - 2014 IEEE 8th International Symposium on Embedded Multicore/Manycore SoCs, MCSoC 2014, p. 0. doi:10.1109/MCSoC.2014.22.
- Directive-Based Auto-Tuning for the Finite Difference Method on the Xeon Phi, in: Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015, p. 0. doi:10.1109/IPDPSW.2015.11.
- Autotuning of exascale applications with anomalies detection. Frontiers in Big Data 4, 657218.
- Optimization by simulated annealing. Science 220, 671--680. doi:10.1126/science.220.4598.671.
- Autotuning search space for loop transformations. CoRR abs/2010.06521. URL: https://arxiv.org/abs/2010.06521, arXiv:2010.06521.
- Gptune: Multitask learning for autotuning exascale applications, in: GPTune: Multitask Learning for Autotuning Exascale Applications, Association for Computing Machinery, New York, NY, USA. p. 234–246. URL: https://doi.org/10.1145/3437801.3441621, doi:10.1145/3437801.3441621.
- Auto-tuning parameter choices in hpc applications using bayesian optimization, in: 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 831--840. doi:10.1109/IPDPS47924.2020.00090.
- Autotuning with machine learning of OpenMP task applications. Ph.D. thesis. ’Université Grenoble Alpes’. URL: https://theses.hal.science/tel-03227414.
- Automated scheduling algorithm selection and chunk parameter calculation in openmp. IEEE Transactions on Parallel and Distributed Systems 33, 4383--4394. doi:10.1109/TPDS.2022.3189270.
- Saving energy by exploiting residual imbalances on iterative applications, in: 2014 21st International Conference on High Performance Computing (HiPC), pp. 1--10. doi:10.1109/HiPC.2014.7116895.
- Exploration of load balancing thresholds to save energy on iterative applications, in: Communications in Computer and Information Science, p. 0. doi:10.1007/978-3-319-57972-6_6.
- A review of the adjoint-state method for computing the gradient of a functional with geophysical applications. Geophysical Journal International 167, 495--503. URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1365-246X.2006.02978.x, doi:https://doi.org/10.1111/j.1365-246X.2006.02978.x, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1365-246X.2006.02978.x.
- Efficient auto-tuning of parallel programs with interdependent tuning parameters via auto-tuning framework (atf). Association for Computing Machinery 18. URL: https://doi.org/10.1145/3427093, doi:10.1145/3427093.
- Shaman: A flexible framework for auto-tuning hpc systems, in: Calzarossa, M.C., Gelenbe, E., Grochla, K., Lent, R., Czachórski, T. (Eds.), Modelling, Analysis, and Simulation of Computer and Telecommunication Systems - 28th International Symposium, MASCOTS 2020, Nice, France, November 17-19, 2020, Revised Selected Papers, Springer. pp. 147--158. URL: https://doi.org/10.1007/978-3-030-68110-4_10, doi:10.1007/978-3-030-68110-4_10.
- Attune: A heuristic based framework for parallel applications autotuning, in: Anais Estendidos do X Simpósio Brasileiro de Engenharia de Sistemas Computacionais, SBC, Porto Alegre, RS, Brasil. pp. 151--156. URL: https://sol.sbc.org.br/index.php/sbesc_estendido/article/view/13105, doi:10.5753/sbesc_estendido.2020.13105.
- Bliss: Auto-tuning complex applications using a pool of diverse lightweight learning models, in: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Association for Computing Machinery, New York, NY, USA. p. 1280–1295. URL: https://doi.org/10.1145/3453483.3454109, doi:10.1145/3453483.3454109.
- Autotuning by changing directives and number of threads in openmp using ppopen-at. ’’ URL: http://rgdoi.net/10.13140/RG.2.2.26988.80005, doi:10.13140/RG.2.2.26988.80005.
- Offsite autotuning approach: Performance model driven autotuning applied to parallel explicit ode methods, in: High Performance Computing: 35th International Conference, ISC High Performance 2020, Frankfurt/Main, Germany, June 22–25, 2020, Proceedings, Springer-Verlag, Berlin, Heidelberg. p. 370–390. URL: https://doi.org/10.1007/978-3-030-50743-5_19, doi:10.1007/978-3-030-50743-5_19.
- Bootstrapping in-situ workflow auto-tuning via combining performance models of component applications, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM. pp. 1--15. URL: https://dl.acm.org/doi/10.1145/3458817.3476197, doi:10.1145/3458817.3476197.
- Coupled simulated annealing. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 40, 320--335. doi:10.1109/TSMCB.2009.2020435.
- Reverse time migration with optimal checkpointing. Geophysics 75, S49--S60.
- Inversion of seismic reflection data in the acoustic approximation. Geophysics 49, 1259--1266. doi:10.1190/1.1441744.
- Artemis: Automatic runtime tuning of parallel execution parameters using machine learning, in: High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, Proceedings, Springer-Verlag, Berlin, Heidelberg. p. 453–472. URL: https://doi.org/10.1007/978-3-030-78713-4_24, doi:10.1007/978-3-030-78713-4_24.