ExaWorks Software Development Kit: A Robust and Scalable Collection of Interoperable Workflow Technologies (2407.16646v1)
Abstract: Scientific discovery increasingly requires executing heterogeneous scientific workflows on high-performance computing (HPC) platforms. Heterogeneous workflows contain different types of tasks (e.g., simulation, analysis, and learning) that need to be mapped, scheduled, and launched on different computing. That requires a software stack that enables users to code their workflows and automate resource management and workflow execution. Currently, there are many workflow technologies with diverse levels of robustness and capabilities, and users face difficult choices of software that can effectively and efficiently support their use cases on HPC machines, especially when considering the latest exascale platforms. We contributed to addressing this issue by developing the ExaWorks Software Development Kit (SDK). The SDK is a curated collection of workflow technologies engineered following current best practices and specifically designed to work on HPC platforms. We present our experience with (1) curating those technologies, (2) integrating them to provide users with new capabilities, (3) developing a continuous integration platform to test the SDK on DOE HPC platforms, (4) designing a dashboard to publish the results of those tests, and (5) devising an innovative documentation platform to help users to use those technologies. Our experience details the requirements and the best practices needed to curate workflow technologies, and it also serves as a blueprint for the capabilities and services that DOE will have to offer to support a variety of scientific heterogeneous workflows on the newly available exascale HPC platforms.
- R. M. Badia Sala, E. Ayguadé Parra, and J. J. Labarta Mancho, “Workflows for science: A challenge when facing the convergence of HPC and big data,” Supercomputing frontiers and innovations, vol. 4, no. 1, pp. 27–47, 2017.
- R. F. Ferreira da Silva, H. Casanova, K. Chard, I. Altintas, R. M. Badia, B. Balis, T. Coleman, F. Coppens, F. Di Natale, B. Enders et al., “A community roadmap for scientific workflows research and development,” in 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS). IEEE, 2021, pp. 81–90.
- A. Al-Saadi, D. H. Ahn, Y. Babuji, K. Chard, J. Corbett, M. Hategan, S. Herbein, S. Jha, D. Laney, A. Merzky et al., “Exaworks: Workflows for exascale,” in 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS). IEEE, 2021, pp. 50–57.
- T. WCI, “Workflows community initiative (WCI),” 2024. [Online]. Available: https://workflows.community/
- R. F. da Silva, R. M. Badia, V. Bala, D. Bard, P.-T. Bremer, I. Buckley, S. Caino-Lores, K. Chard, C. Goble, S. Jha et al., “Workflows Community Summit 2022: A roadmap revolution,” arXiv preprint arXiv:2304.00019, 2023.
- P. Amstutz, M. R. Crusoe, N. Tijanić, B. Chapman, J. Chilton, M. Heuer, A. Kartashov, D. Leehr, H. Ménager, M. Nedeljkovich et al., “Common Workflow Language, v1. 0,” 2016.
- M. J. Blin, J. Wainer, and C. B. Medeiros, “A reuse-oriented workflow definition language,” International Journal of Cooperative Information Systems, vol. 12, no. 01, pp. 1–36, 2003.
- R. A. DeLine, “Glinda: Supporting data science with live programming, guis and a domain-specific language,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–11.
- S. Ristov, S. Pedratscher, and T. Fahringer, “AFCL: An abstract function choreography language for serverless workflow specification,” Future Generation Computer Systems, vol. 114, pp. 368–382, 2021.
- V. Korkhov, D. Krefting, J. Montagnat, T. T. Huu, T. Kukla, G. Terstyanszky, D. Manset, M. W. Caan, and S. D. Olabarriaga, “SHIWA workflow interoperability solutions for neuroimaging data analysis.” in HealthGrid, 2012, pp. 109–110.
- W. Zhang, A. Myers, K. Gott, A. Almgren, and J. Bell, “AMReX: Block-structured adaptive mesh refinement for multiphysics applications,” The International Journal of High Performance Computing Applications, vol. 35, no. 6, pp. 508–526, 2021. [Online]. Available: https://doi.org/10.1177/10943420211022811
- P. Kumar et al., “FerroX massively parallel, 3D phase-field simulation framework,” 2021. [Online]. Available: https://github.com/AMReX-Microelectronics/FerroX
- The xSDK Team, “xSDK: Extreme-scale scientific software development kit,” 2017. [Online]. Available: https://xsdk.info
- M. A. Heroux, “Scalable delivery of scalable libraries and tools: How ecp delivered a software ecosystem for exascale and beyond,” arXiv preprint arXiv:2311.06995, 2023.
- The E4S Project, “E4S: A software stack for HPC-AI applications,” 2024. [Online]. Available: https://e4s-project.github.io/
- M. Rocklin et al., “Dask: Parallel computation with blocked algorithms and task scheduling,” in Proceedings of the 14th python in science conference, vol. 130. SciPy Austin, TX, 2015, pp. 126–132.
- D. H. Ahn, J. Garlick, M. Grondona, D. Lipari, B. Springmeyer, and M. Schulz, “Flux: A next-generation resource management framework for large hpc centers,” in 2014 43rd International Conference on Parallel Processing Workshops, 2014, pp. 9–17.
- F. Di Natale, H. Bhatia, T. S. Carpenter, C. Neale, S. Kokkila-Schumacher, T. Oppelstrup, L. Stanton, X. Zhang, S. Sundram, T. R. W. Scogland, G. Dharuman, M. P. Surh, Y. Yang, C. Misale, L. Schneidenbach, C. Costa, C. Kim, B. D’Amora, S. Gnanakaran, D. V. Nissley, F. Streitz, F. C. Lightstone, P.-T. Bremer, J. N. Glosli, and H. I. Ingólfsson, “A massively parallel infrastructure for adaptive multiscale simulations: Modeling RAS initiation pathway for cancer,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1–16.
- Y. Babuji, A. Woodard, Z. Li, D. S. Katz, B. Clifford, R. Kumar, L. Lacinski, R. Chard, J. M. Wozniak, I. Foster, M. Wilde, and K. Chard, “Parsl: Pervasive parallel programming in Python,” in 28th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2019, pp. 25–36.
- M. Hategan-Marandiuc, A. Merzky, N. Collier, K. Maheshwari, J. Ozik, M. Turilli, A. Wilke, J. M. Wozniak, K. Chard, I. Foster, R. F. da Silva, S. Jha, and D. Laney, “Psi/j: A portable interface for submitting, monitoring, and managing jobs,” in 2023 IEEE 19th International Conference on e-Science (e-Science), 2023, pp. 1–10.
- V. Balasubramanian, A. Treikalis, O. Weidner, and S. Jha, “Ensemble toolkit: Scalable and flexible execution of ensembles of tasks,” in 2016 45th International Conference on Parallel Processing (ICPP). IEEE, 2016, pp. 458–463.
- A. Merzky, M. Turilli, M. Titov, A. Al-Saadi, and S. Jha, “Design and performance characterization of Radical-Pilot on leadership-class platforms,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 4, pp. 818–829, 2021.
- S. Partee, M. Ellis, A. Rigazzi, A. E. Shao, S. Bachman, G. Marques, and B. Robbins, “Using machine learning at scale in numerical simulations with SmartSim: An application to ocean climate modeling,” Journal of Computational Science, vol. 62, p. 101707, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1877750322001065
- J. M. Wozniak, T. G. Armstrong, M. Wilde, D. S. Katz, E. Lusk, and I. T. Foster, “Swift/T: Large-scale application composition via distributed-memory dataflow processing,” in 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE, 2013, pp. 95–102.
- D. H. Ahn, N. Bass, A. Chu, J. Garlick, M. Grondona, S. Herbein, H. I. Ingólfsson, J. Koning, T. Patki, T. R. Scogland, B. Springmeyer, and M. Taufer, “Flux: Overcoming scheduling challenges for exascale workflows,” Future Generation Computer Systems, vol. 110, pp. 202–213, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167739X19317169
- M. Turilli, M. Santcroos, and S. Jha, “A comprehensive perspective on pilot-job systems,” ACM Computing Surveys (CSUR), vol. 51, no. 2, pp. 1–32, 2018.
- A. Luckow, M. Santcroos, A. Merzky, O. Weidner, P. Mantha, and S. Jha, “P: a model of pilot-abstractions,” in 2012 IEEE 8th International Conference on E-Science. IEEE, 2012, pp. 1–10.
- A. Alsaadi, L. Ward, A. Merzky, K. Chard, I. Foster, S. Jha, and M. Turilli, “RADICAL-Pilot and Parsl: Executing heterogeneous workflows on HPC platforms,” in 2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS). Los Alamitos, CA, USA: IEEE Computer Society, nov 2022, pp. 27–34. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/WORKS56498.2022.00009
- A. Merzky, O. Weidner, and S. Jha, “Saga: A standardized access layer to heterogeneous distributed computing infrastructure,” SoftwareX, vol. 1, pp. 3–8, 2015.
- J. M. Wozniak, R. Jain, P. Balaprakash, J. Ozik, N. T. Collier, J. Bauer, F. Xia, T. Brettin, R. Stevens, J. Mohd-Yusof et al., “CANDLE/Supervisor: A workflow framework for machine learning applied to cancer research,” BMC bioinformatics, vol. 19, no. 18, pp. 59–69, 2018.
- “GNU M4 Web Site,” 2024. [Online]. Available: https://www.gnu.org/software/m4
- R. Carson, M. Rolchigo, J. Coleman, M. Titov, J. Belak, and M. Bement, “Uncertainty quantification of metal additive manufacturing processing conditions through the use of exascale computing,” in Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, ser. SC-W ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 380–383. [Online]. Available: https://doi.org/10.1145/3624062.3624103
- J. Bader, J. Belak, M. Bement, M. Berry, R. Carson, D. Cassol, S. Chan, J. Coleman, K. Day, A. Duque, K. Fagnan, J. Froula, S. Jha, D. S. Katz, P. Kica, V. Kindratenko, E. Kirton, R. Kothadia, D. Laney, F. Lehmann, U. Leser, S. Lichołai, M. Malawski, M. Melara, E. Player, M. Rolchigo, S. Sarrafan, S.-J. Sul, A. Syed, L. Thamsen, M. Titov, M. Turilli, S. Caino-Lores, and A. Mandal, “Novel approaches toward scalable composable workflows in hyper-heterogeneous computing environments,” in Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, ser. SC-W ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 2097–2108. [Online]. Available: https://doi.org/10.1145/3624062.3626283
- The ExaAM Project, “GitHub UQ repository,” 2023. [Online]. Available: https://github.com/ExascaleAM/Workflows
- T. Gamblin, M. LeGendre, M. R. Collette, G. L. Lee, A. Moody, B. R. De Supinski, and S. Futral, “The Spack package manager: bringing order to hpc software chaos,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2015, pp. 1–12.
- J. M. Wozniak, H. Yoo, J. Mohd-Yusof, B. Nicolae, N. Collier, J. Ozik, T. Brettin, and R. Stevens, “High-bypass learning: Automated detection of tumor cells that significantly impact drug response,” in 2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) and Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S). IEEE, 2020, pp. 1–10.
- L. Ward, G. Sivaraman, J. Pauloski, Y. Babuji, R. Chard, N. Dandu, P. C. Redfern, R. S. Assary, K. Chard, L. A. Curtiss, R. Thakur, and I. Foster, “Colmena: Scalable machine-learning-based steering of ensemble simulations for high performance computing,” in 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). Los Alamitos, CA, USA: IEEE Computer Society, nov 2021, pp. 9–20. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/MLHPC54614.2021.00007
- The Vue.js Team, “Vue.js,” 2024. [Online]. Available: https://vuejs.org/
- The Conda Team, “conda.or,” 2024. [Online]. Available: https://conda.org/
- T. Gamblin, M. P. LeGendre, M. R. Collette, G. L. Lee, A. Moody, B. R. de Supinski, and W. S. Futral, “The spack package manager: Bringing order to HPC software chaos,” in Supercomputing 2015 (SC’15), Austin, Texas, November 15-20 2015, pp. 1–12. [Online]. Available: http://tgamblin.github.io/pubs/spack-sc15.pdf
- The ExaWorks Project, “ExaWorks: Software Development Kit,” 2024. [Online]. Available: https://exaworkssdk.readthedocs.io/en/latest/
- ——, “ExaWorks Software Development Kit Docker Container,” 2024. [Online]. Available: https://github.com/ExaWorks/SDK/tree/master/docker
- The Binder Project, “Reproducible, sharable, open, interactive computing environments,” 2024. [Online]. Available: https://mybinder.org/
- The PESO Project, “PESO: Partnering for scientific software ecosystem stewardship opportunities,” 2024. [Online]. Available: https://pesoproject.org/
- The SWAS Project, “SWAS: Sustaining workflows & application services,” 2024. [Online]. Available: https://swas.center/