ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Published 29 Jul 2024 in cs.AI | (2407.20143v4)

Abstract: Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallelism configurations. In addition, saved checkpoints are dispatched to evaluation tasks or transferred across different training stages (e.g., from pre-training to post-training). All these scenarios require resharding distributed checkpoints from one parallelism to another. In production environments, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales. A high-performance checkpointing system is needed to enable efficient checkpoint management at scale throughout the lifecycle of LFM development. We introduce ByteCheckpoint, an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint features: a parallelism-agnostic checkpoint representation that enables efficient load-time checkpoint resharding; a generic checkpoint saving/loading workflow to accommodate multiple training frameworks and support different storage backends; full-stack optimizations to ensure high I/O efficiency and scalability; a suite of monitoring tools to streamline large-scale performance analysis and bottleneck detection. Compared to existing open-source checkpointing systems [52, 58], ByteCheckpoint significantly reduces runtime checkpoint stalls, achieving an average reduction of 54.20x. For saving and loading times, ByteCheckpoint achieves improvements of up to 9.96x and 8.80x, respectively.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a unified checkpointing system that offers automatic online resharding and decoupled storage architecture for efficient LLM development.
It introduces asynchronous tensor merging and zero redundant loading to handle irregular tensor sharding and reduce I/O overhead.
Experimental results exhibit up to 529x reduction in checkpoint stalls and 3.5x faster loading times, emphasizing the system's efficiency across various frameworks.

ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development

Introduction

The paper presents ByteCheckpoint, a PyTorch-native checkpointing solution designed to address the complex checkpointing needs of LLMs. Given the scale and diversity of LLM training practices, ByteCheckpoint facilitates automatic online checkpoint resharding and supports multiple training frameworks and storage backends efficiently.

Motivation and Challenges

Checkpointing LLMs is critical for ensuring fault tolerance in environments where models span thousands of GPUs. Traditional systems often assume consistent parallelism during save and load operations, which poses limitations when adapting checkpoints to varying GPU availabilities and task requirements. ByteCheckpoint overcomes these limitations by providing seamless integration across different parallelism strategies, including TP, DP, and PP, and frameworks such as FSDP and Megatron-LM.

Figure 1: Various checkpoint resharding requirements in real-world LLM production. Users may use different parallelism strategies and training frameworks to save/load checkpoints for their tasks. We only show GPU states here for simplicity.

System Design

ByteCheckpoint's architecture is centered on a disaggregated storage system that separates data and metadata. This decoupling allows it to efficiently manage and transform checkpoints regardless of the training framework.

Figure 2: Storage architecture of ByteCheckpoint. In this example, distributed checkpoints are saved with four training processes.

Automatic Online Resharding

The system supports online resharding, where checkpoints saved under one configuration can be loaded into a different parallel configuration without the need for manual intervention. This flexibility is particularly beneficial for tasks that necessitate frequent changes in GPU allocation.

Figure 3: An illustration of automatic online resharding. Assume that each tensor shard is retained in its original shape before saving.

Addressing Irregular Tensor Sharding

ByteCheckpoint introduces asynchronous tensor merging techniques to handle cases where tensors are irregularly sharded, such as in certain optimizer states in Megatron-LM and veScale. This strategy reduces communication overhead and supports efficient parallelism remapping.

Figure 4: Irregular tensor sharding in the distributed optimizers of Megatron-LM.

Workflow and API Design

ByteCheckpoint distinguishes itself with a simplified API that abstracts the complexities of checkpoint management. Users interact with two main functions: bytecheckpoint.save() and bytecheckpoint.load(), which handle checkpointing details internally.

Figure 5: Workflow of ByteCheckpoint.

I/O Performance Optimization

The paper details several optimization techniques to enhance checkpointing efficiency:

Asynchronous Pipeline: ByteCheckpoint employs a fully asynchronous pipeline that decouples the stages of checkpoint storage, reducing GPU idling and increasing ETTR.
Zero Redundant Loading: This technique eliminates unnecessary data reads, optimizing loading times significantly.
Figure 6: Examples of using ByteCheckpoint to save and load checkpoints with various storage backends and training frameworks.

Experimental Results

Experiments demonstrate ByteCheckpoint's superiority over baseline systems in reducing checkpoint stalls by up to 529.22 $\times$ , with loading times improved by up to 3.51 $\times$ . These results underscore the system's efficiency and adaptability in real-world production environments.

Conclusion

ByteCheckpoint sets a new standard for checkpointing systems in LLM development by offering efficient, scalable solutions to the issues of saving and loading large distributed models. Its innovative architecture and optimization strategies pave the way for broader applications across diverse AI tasks, emphasizing its pivotal role in advancing LLM training practices.

Markdown