Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

System-level Scalable Checkpoint-Restart for Petascale Computing (1607.07995v2)

Published 27 Jul 2016 in cs.DC and cs.OS

Abstract: Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that InfiniBand UD is required to support modern MPI implementations. An extrapolation from the current results to future SSD-based storage systems provides evidence that the current approach will remain practical in the exascale generation. This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package. Results are shown for HPCG (linear algebra), NAMD (molecular dynamics), and the NAS NPB benchmarks. In tests up to 32,752 MPI processes on 32,752 CPU cores, checkpointing of a computation with a 38 TB memory footprint in 11 minutes is demonstrated. Runtime overhead is reduced to less than 1%. The approach is also evaluated across three widely used MPI implementations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiajun Cao (10 papers)
  2. Kapil Arya (10 papers)
  3. Rohan Garg (10 papers)
  4. Shawn Matott (1 paper)
  5. Dhabaleswar K. Panda (11 papers)
  6. Hari Subramoni (16 papers)
  7. Jérôme Vienne (1 paper)
  8. Gene Cooperman (31 papers)
Citations (31)

Summary

We haven't generated a summary for this paper yet.