Portable, heterogeneous ensemble workflows at scale using libEnsemble (2403.03709v3)
Abstract: libEnsemble is a Python-based toolkit for running dynamic ensembles, developed as part of the DOE Exascale Computing Project. The toolkit utilizes a unique generator--simulator--allocator paradigm, where generators produce input for simulators, simulators evaluate those inputs, and allocators decide whether and when a simulator or generator should be called. The generator steers the ensemble based on simulation results. Generators may, for example, apply methods for numerical optimization, machine learning, or statistical calibration. libEnsemble communicates between a manager and workers. We overview the unique characteristics of libEnsemble as well as current and potential interoperability with other packages in the workflow ecosystem. We highlight libEnsemble's dynamic resource features: libEnsemble can detect system resources, such as available nodes, cores, and GPUs, and assign these in a portable way. These features allow users to specify the number of processors and GPUs required for each simulation; and resources will be automatically assigned on a wide range of systems, including Frontier, Aurora, and Perlmutter. Such ensembles can include multiple simulation types, some using GPUs and others using only CPUs, sharing nodes for maximum efficiency. We also describe the benefits of libEnsemble's generator--simulator coupling, which easily exposes to the user the ability to cancel, and portably kill, running simulations based on models that are updated with intermediate simulation output. We demonstrate libEnsemble's capabilities, scalability, and scientific impact via a Gaussian process surrogate training problem for the longitudinal density profile at the exit of a plasma accelerator stage. The study uses gpCAM for the surrogate model and employs either Wake-T or WarpX simulations, highlighting efficient use of resources that can easily extend to exascale.
- libEnsemble: A library to coordinate the concurrent evaluation of dynamic ensembles of calculations. IEEE Transactions on Parallel and Distributed Systems 2022; 33(4): 977–988. doi:10.1109/tpds.2021.3082815.
- libEnsemble: A complete Python toolkit for dynamic ensembles of calculations. Journal of Open Source Software 2023; 8(92): 6031. doi:10.21105/joss.06031.
- libEnsemble, 2024. URL https://github.com/Libensemble/libEnsemble.
- libEnsemble users manual, 2022. URL https://buildmedia.readthedocs.org/media/pdf/libensemble/latest/libensemble.pdf.
- Ensemble toolkit: Scalable and flexible execution of ensembles of tasks. In 2016 45th International Conference on Parallel Processing. IEEE. doi:10.1109/icpp.2016.59.
- Colmena: Scalable machine-learning-based steering of ensemble simulations for high performance computing. In IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments. pp. 9–20. doi:10.1109/MLHPC54614.2021.00007.
- Ray: A distributed framework for emerging AI applications, 2018. doi:10.48550/arXiv.1712.05889.
- Cunningham W. AgnostiqHQ/covalent: v0.228.0-rc.0, 2023. doi:10.5281/zenodo.5903364.
- PETSc/TAO users manual. Technical Report ANL-21/39 - Revision 3.20, Argonne National Laboratory, 2023. doi:10.2172/2205494.
- PSI/J: A portable interface for submitting, monitoring, and managing jobs. In 2023 IEEE 19th International Conference on e-Science. pp. 1–10. doi:10.1109/e-Science58273.2023.10254912.
- funcX: A federated function serving fabric for science. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. pp. 65–76. doi:10.1145/3369583.3392683.
- xSDK foundations: Toward an extreme-scale scientific software development kit. Supercomputing Frontiers and Innovations: an International Journal 2017; 4(1): 69–82. doi:10.14529/jsfi170104.
- Comparison of model-based and heuristic optimization algorithms applied to photoinjectors using libEnsemble. In Proceedings of the 13th International Computational Accelerator Physics Conference. pp. 22–24. doi:10.18429/JACoW-ICAP2018-SAPAF03.
- Comparison of multiobjective optimization methods for the LCLS-II photoinjector. Computer Physics Communications 2023; 283: 108566. doi:10.1016/j.cpc.2022.108566.
- Multitask optimization of laser-plasma accelerators using simulation codes with different fidelities. In Proceedings of the 13th International Particle Accelerator Conference. doi:10.18429/JACoW-IPAC2022-WEPOST030.
- Managing computationally expensive blackbox multiobjective optimization problems with libEnsemble. In Proceedings of the Spring Simulation Conference. doi:10.22360/springsim.2020.hpc.001.
- surmise 0.2.1 users manual. Technical Report Version 0.2.1, NAISE, 2023. URL https://surmise.readthedocs.io.
- Chan MY. High-Dimensional Gaussian Process Methods for Uncertainty Quantification. PhD Thesis, Northwestern University, 2023.
- Integrating ytopt and libEnsemble to autotune OpenMC, 2024. doi:10.48550/arXiv.2402.09222.
- libEnsemble Community. A selection of libEnsemble functions and complete workflows from the community, 2023. URL https://github.com/Libensemble/libe-community-examples.
- Bayesian optimization of laser-plasma accelerators assisted by reduced physical models. Physical Review Accelerators and Beams 2023; 26: 084601. doi:10.1103/PhysRevAccelBeams.26.084601.
- ParMOO: A Python library for parallel multiobjective simulation optimization. Journal of Open Source Software 2023; 8(82): 4468. doi:10.21105/joss.04468.
- Designing a framework for solving multiobjective simulation optimization problems. Technical Report 2304.06881, arXiv, 2023. URL https://arxiv.org/abs/2304.06881.
- RadiaSoft. rsopt, 2024. URL https://github.com/radiasoft/rsopt.
- Array programming with NumPy. Nature 2020; 585(7825): 357–362. doi:10.1038/s41586-020-2649-2.
- Asynchronously parallel optimization solver for finding multiple minima. Mathematical Programming Computation 2018; 10(3): 303–332. doi:10.1007/s12532-017-0131-4.
- Chan MYH, Plumlee M and Wild SM. Constructing a simulation surrogate with partially observed output. Technometrics 2024; 66(1): 1–13. doi:10.1080/00401706.2023.2210170.
- Balsam: Automated scheduling and execution of dynamic, data-intensive HPC workflows, 2019. doi:10.48550/arXiv.1909.08704.
- mpi4py: Status update after 12 years of development. Computing in Science & Engineering 2021; 23(4): 47–54. doi:10.1109/MCSE.2021.3083216.
- Demonstration of relativistic electron beam focusing by a laser-plasma lens. Nature Communications 2015; 6(1). doi:10.1038/ncomms7860.
- gpcam, 2023. doi:10.5281/zenodo.10393189.
- Gramacy RB. Surrogates: Gaussian Process Modeling, Design and Optimization for the Applied Sciences. Boca Raton, Florida: Chapman Hall/CRC, 2020. doi:10.1201/9780367815493.
- Methods and Applications of Autonomous Experimentation. Chapman and Hall/CRC, 2023. doi:10.1201/9781003359593.
- Exact Gaussian processes for massive datasets via non-stationary sparsity-discovering kernels. Scientific Reports 2023; 13(1): 3155. doi:10.1038/s41598-023-30062-8.
- Ferran Pousa A, Assmann R and Martinez de la Ossa A. Wake-T: A fast particle tracking code for plasma-based accelerators. Journal of Physics: Conference Series 2019; 1350(1): 012056. doi:10.1088/1742-6596/1350/1/012056.
- Sürer O, Plumlee M and Wild SM. Sequential Bayesian experimental design for calibration of expensive simulation models. Technometrics 2024; doi:10.1080/00401706.2023.2246157. To appear.
- Pushing the frontier in the design of laser-based electron accelerators with groundbreaking mesh-refined particle-in-cell simulations on exascale-class supercomputers. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 1–12. doi:10.1109/SC41404.2022.00008.
- Rocklin M. Dask: Parallel computation with blocked algorithms and task scheduling. In Proceedings of the Python in Science Conference. SciPy. doi:10.25080/majora-7b98e3ed-013.
- Accelerating communications in federated applications with transparent object proxies. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’23, New York, NY, USA: Association for Computing Machinery. doi:10.1145/3581784.3607047.