Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training (2406.18820v2)

Published 27 Jun 2024 in cs.DC and cs.LG

Abstract: Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces Universal Checkpointing (UCP) which efficiently creates checkpoints and flexibly resumes training across diverse hardware setups.
UCP uses a dual mechanism with distributed saving and consolidated loading to ensure minimal overhead during checkpoint transformation.
Integrated into DeepSpeed, UCP maintains consistent training loss and enhances fault tolerance by adapting to varying parallelism strategies.

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

The paper entitled "Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training" addresses a critical limitation in distributed deep learning training systems, namely, the inefficiency and inflexibility of existing checkpointing mechanisms. The paper introduces Universal Checkpointing (UCP), a novel technique designed to provide efficient checkpoint creation and the flexibility to resume training on varying parallelism strategies and hardware configurations. This innovation is especially pertinent given the prevalence of hardware failures and varying hardware availability during the prolonged training periods of LLMs.

Key Contributions

Universal Checkpoint Format: The core idea of UCP is the optimal selection of representations during the checkpointing life cycle. Distributed representation is utilized for saving, and consolidated representation for loading. This is achieved with two key mechanisms:
- Universal Checkpoint Format: This consists of consolidated representations of each model parameter alongside metadata for mapping parameter fragments onto different model-parallelism configurations.
- Universal Checkpoint Language: A specification language that facilitates the conversion of distributed checkpoints into the universal checkpoint format, enabling interoperability across diverse parallelism strategies.
Implementation in DeepSpeed: UCP is integrated into DeepSpeed, a widely-used deep learning optimization library. This implementation supports flexible checkpoint transformation across various parallelism techniques, showcases elastic resource management, and provides cross-framework compatibility with platforms such as HuggingFace, PyTorch Lightning, and more.
Transformation Operations and Patterns: UCP's design includes a set of transformation operations and patterns that help in systematically converting distributed checkpoints into unified, universal checkpoints. This ensures minimal overhead and no degradation in model quality after resuming training.

Evaluation and Results

The evaluation of UCP is conducted on various state-of-the-art model architectures, including Megatron-LM GPT, LLaMA, and a variant of sparse MoEs. The main findings highlight:

Consistent Training Loss: When transforming and resuming checkpoints with different parallelism strategies and hardware configurations, the training loss remains consistent with the original training runs. This was validated across various parallelism configurations and hardware setups, demonstrating UCP's robustness and reliability in maintaining model convergence.
Negligible Overhead: UCP introduces only minimal additional overhead in terms of loading times compared to standard distributed training. The benchmarking shows that the added overhead of using UCP, including the transformation step, varies slightly between 1.14x to 1.37x, which is negligible in the context of end-to-end training times.

Practical and Theoretical Implications

Practical Implications: UCP substantially enhances fault tolerance and resource flexibility in large-scale distributed training:

Resilience to Hardware Failures: Training can continue on remaining healthy hardware without waiting for failed nodes, reducing resource waste and shortening training times.
Elastic Capacity Utilization: UCP allows for dynamic scaling of training processes, making opportunistic use of available hardware when resources fluctuate.

Theoretical Implications: This work provides a structured approach to a long-standing problem in distributed training, potentially informing the development of more advanced and adaptive checkpointing mechanisms. The partition-based approach to representing and managing checkpoints could inspire further research into distributed systems' fault tolerance and dynamic resource management.

Future Developments

Future work on UCP may focus on:

Extending the list of supported patterns to accommodate emerging parallelism strategies.
Enhancing the efficiency of the UCP conversion process to further minimize overhead.
Improving the integration and interoperability of UCP with additional distributed training frameworks and hardware accelerators.

Overall, Universal Checkpointing stands as a significant advancement in the field of distributed deep learning training. Its ability to adapt to different training configurations without compromising on training performance marks a crucial step towards more resilient and flexible deep learning systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/StasBekman/status/1808287880781127930