Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Study of Checkpointing in Large Scale Training of Deep Neural Networks (2012.00825v2)

Published 1 Dec 2020 in cs.DC

Abstract: Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Elvis Rojas (1 paper)
  2. Albert Njoroge Kahira (4 papers)
  3. Esteban Meneses (5 papers)
  4. Leonardo Bautista Gomez (5 papers)
  5. Rosa M Badia (35 papers)
Citations (20)

Summary

We haven't generated a summary for this paper yet.