Collective Vector Clocks: Low-Overhead Transparent Checkpointing for MPI
Abstract: Taking snapshots of the state of a distributed computation is useful for off-line analysis of the computational state, for later restarting from the saved snapshot, for cloning a copy of the computation, and for migration to a new cluster. The problem is made more difficult when supporting collective operations across processes, such as barrier, reduce operations, scatter and gather, etc. Some processes may have reached the barrier or other collective operation, while other processes wait a long time to reach that same barrier or collective operation. At least two solutions are well-known in the literature: (I) draining in-flight network messages and then freezing the network at checkpoint time; and (ii) adding a barrier prior to the collective operation, and either completing the operation or aborting the barrier if not all processes are present. Both solutions suffer important drawbacks. The code in the first solution must be updated whenever one ports to a newer network. The second solution implies additional barrier-related network traffic prior to each collective operation. This work presents a third solution that avoids both drawbacks. There is no additional barrier-related traffic, and the solution is implemented entirely above the network layer. The work is demonstrated in the context of transparent checkpointing of MPI libraries for parallel computation, where each of the first two solutions have already been used in prior systems, and then abandoned due to the aforementioned drawbacks. Experiments demonstrate the low runtime overhead of this new, network-agnostic approach. The approach is also extended to non-blocking, collective operations in order to handle overlapping of computation and communication.
- DMTCP: Transparent checkpointing for cluster computations and the desktop. In 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS’09), pages 1–12. IEEE, 2009.
- Post-failure recovery of mpi communication capability: Design and rationale. The International Journal of High Performance Computing Applications, 27(3):244–254, 2013.
- Real-time XFEL data analysis at SLAC and NERSC: a trial run of nascent exascale experimental data analysis. Technical report, 2021.
- FTI: High performance fault tolerance interface for hybrid systems. In Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, pages 1–32, 2011.
- MPICH-V project: A multiprotocol automatic fault-tolerant MPI. The International Journal of High Performance Computing Applications, 20(3):319–333, 2006.
- Rollback-dependency trackability: A minimal characterization and its protocol. Information and Computation, 165(2):144–173, 2001.
- BLCR team. Berkeley Lab Checkpoint/Restart for Linux (BLCR) downloads. https://crd.lbl.gov/divisions/amcr/computer-science-amcr/class/research/past-projects/BLCR/berkeley-lab-checkpoint-restart-for-linux-blcr-downloads/. ”[Online; accessed Apr-2023]”.
- Fundamentals of distributed computing: A practical tour of vector clock systems. IEEE Distributed Systems Online, 3(02), 2002.
- How a lightsource uses a supercomputer for live interactive analysis of large data sets: Perspectives on the NERSC-LCLS superfacility. Synchrotron Radiation News, pages 1–7, September 2023.
- Atomic broadcast: From simple message diffusion to Byzantine agreement. Information and Computation, 118(1):158–179, 1995.
- Transparent checkpoint-restart over infiniband. In Proc. of the 23rd Int. Symp. on High-Performance Parallel and Distributed Computing (HPDC’14), pages 13–24, 2014.
- Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63–75, 1985.
- Automation of NERSC application usage report. In 2020 IEEE/ACM International Workshop on HPC User Support Tools (HUST) and Workshop on Programming and Performance Visualization Tools (ProTools), pages 10–18. IEEE, 2020.
- Experiences with cross-facility real-time light source data analysis workflows. In 2021 IEEE/ACM HPC for Urgent Decision Making (UrgentHPC), pages 45–53. IEEE, 2021.
- MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, pages 49–60, 2019.
- Application-transparent checkpoint/restart for MPI programs over InfiniBand. In Int. Conf. on Parallel Processing (ICPP’06), pages 471–478, 2006.
- Jürgen Hafner. Ab-initio simulations of materials using VASP: Density-functional theory and beyond. Journal of computational chemistry, 29(13):2044–2078, 2008.
- Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 46(1):067, 2006.
- Optimizing a conjugate gradient solver with non-blocking collective operations. Parallel Computing, 33(9):624–633, 2007.
- Interconnect agnostic checkpoint/restart in Open MPI. In Proc. of the 18th ACM Int. Symp. on High Performance Distributed Computing, pages 49–58, 2009.
- The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In 2007 IEEE International Parallel and Distributed Processing Symposium, pages 1–8. IEEE, 2007.
- Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems, 106:467–481, 2020.
- Evaluating and extending User-Level Fault Tolerance in MPI applications. The International Journal of High Performance Computing Applications, 30(3):305–319, 2016.
- Design, modeling, and evaluation of a scalable multi-level checkpointing system. In SC’10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–11. IEEE, 2010.
- Message Passing Interface Forum. MPI: A Message Passing Interface standard: Version 3.1. https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf, June 2015.
- Message Passing Interface Forum. MPI: A Message Passing Interface standard: Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf, June 2021.
- NERSC, the primary scientific computing facility for the Office of Science in the U.S. Department of Energy. https://nersc.gov/.
- Network-Based Computing Laboratory. Osu micro-benchmarks. https://github.com/forresti/osu-micro-benchmarks/, 2022.
- VeloC: Towards high performance adaptive asynchronous checkpointing at large scale. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 911–920. IEEE, 2019.
- Constrained molecular dynamics approach to fermionic systems. Physical Review C, 64(2):024612, 2001.
- A fourth order accurate finite difference scheme for the elastic wave equation in second order formulation. 52(1):17–48.
- LAMMPS-a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications, 271:108171, 2022.
- TOP500: The list. https://top500.org/, June 2023. [Online; accessed Oct-2023].
- Martijn Marsman. VASP developer. Personal communication, 2022.
- Scalable collectives for distributed asynchronous many-task runtimes. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 436–445. IEEE, 2018.
- Hoplite: Efficient and fault-tolerant collective communication for task-based distributed systems. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, pages 641–656, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.