MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training (2408.04307v3)

Published 8 Aug 2024 in cs.DC and cs.LG

Abstract: As LLMs continue to scale up, distributed training systems have expanded beyond 10k nodes, intensifying the importance of fault tolerance. Checkpoint has emerged as the predominant fault tolerance strategy, with extensive studies dedicated to optimizing its efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model presents new challenges due to the substantial increase in model size, despite comparable computational demands to dense models. In this work, we propose the Mixture-of-Checkpoint System (MoC-System) to orchestrate the vast array of checkpoint shards produced in distributed training systems. MoC-System features a novel Partial Experts Checkpointing (PEC) mechanism, an algorithm-system co-design that strategically saves a selected subset of experts, effectively reducing the MoE checkpoint size to levels comparable with dense models. Incorporating hybrid parallel strategies, MoC-System involves fully sharded checkpointing strategies to evenly distribute the workload across distributed ranks. Furthermore, MoC-System introduces a two-level checkpointing management method that asynchronously handles in-memory snapshots and persistence processes. We build MoC-System upon the Megatron-DeepSpeed framework, achieving up to a 98.9% reduction in overhead for each checkpointing process compared to the original method, during MoE model training with ZeRO-2 data parallelism and expert parallelism. Additionally, extensive empirical analyses substantiate that our methods enhance efficiency while maintaining comparable model accuracy, even achieving an average accuracy increase of 1.08% on downstream tasks.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces the PEC mechanism, reducing checkpoint sizes by 54.2% while maintaining minimal accuracy loss in Sparse MoE training.
It employs adaptive data-parallel sharding and asynchronous checkpointing with triple-buffering to enhance fault tolerance efficiently.
Experimental results on an 8-expert GPT-MoE model show a 76.9% reduction in per-rank workload and significant efficiency gains.

MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training

The efficient handling of fault tolerance in large-scale neural networks, specifically in the Sparse Mixture-of-Experts (MoE) models, presents unique challenges due to the increased complexity and size of these models. This paper introduces the Partial Experts Checkpoint (PEC) mechanism and a corresponding fault-tolerant system designed to address these challenges effectively.

Background and Motivation

As transformer-based LLMs continue to scale, ensuring fault tolerance becomes paramount in distributed systems. The traditional checkpointing strategy, while effective for dense models, struggles with the additional complexities brought about by the sparsely-gated MoE models. In this context, MoE models, which aim to expand parameter counts while maintaining manageable computational demands, require tailored solutions to manage the increased checkpoint data volume efficiently.

Partial Experts Checkpoint (PEC) Mechanism

The PEC mechanism proposes selectively checkpointing only a subset of experts within an MoE layer, significantly reducing the checkpoint size to that of dense models. By doing so, it mitigates the storage and time overhead associated with traditional checkpoint methods.

Figure 1: The saved components of training states in a PEC checkpoint, including the expert part with partial selection (orange) and the non-expert part with data-parallel sharding (green).

Effects on Model Quality

The introduction of PEC raises potential concerns regarding its impact on model training accuracy. The paper addresses this by introducing the Portion of Lost Tokens (PLT) metric, which serves to quantify accuracy losses. The empirical analysis suggests negligible accuracy degradation when the PLT remains below 7.5%, aligning well with existing dropout mechanisms inherent in deep learning frameworks.

Enhanced Fault-Tolerant System

The PEC system is further optimized with a series of enhancements aimed at mitigating the impact of faults without detracting from model quality. The key components include asynchronous checkpointing with triple-buffering, adaptive data-parallel sharding, and strategies like Dynamic-K to adapt to the varying fault scenarios over extended training periods.

Figure 2: Schematic of two-level PEC saving and recovery within PEC fault-tolerant system.

Two-Level PEC Strategy

The two-level PEC approach implements a split between snapshot-PEC and persist-PEC methodologies, optimizing overhead management at different phases of the checkpoint process. This structure is pivotal in achieving full overlap with the training workflow, minimizing the residual overhead from the checkpointing operations.

Experimental Results

The paper's experimental evaluation, conducted using an 8-expert GPT-MoE model, demonstrates significant efficiency gains. A notable 54.2% reduction in checkpoint size is reported without sacrificing model fidelity. Furthermore, a 76.9% reduction in the per-rank checkpoint workload showcases the practical benefits of the PEC system.

Figure 3: Experimental results of training GPT-MoE with PEC using sequential expert selection.

Comparison with Baseline

The adaptive and sequential strategies proposed under PEC outperform conventional checkpointing, significantly reducing duplicate data storage and improving overall training efficiency. The deployment of load-aware expert selection further fine-tunes the balance between resource allocation and fault-tolerant training.

Conclusion

The PEC mechanism and system offer significant advancements in the efficient realization of fault-tolerant training for MoE models. By addressing the inherent scalability challenges, PEC proves to be a pivotal development in enhancing the robustness and efficiency of AI infrastructures handling large-scale, sparse models. Future explorations aim to refine the balance between operational efficiency and model fidelity, leveraging advancements in gradient compression and optimization strategies. This continued innovation is critical in supporting the scaling demands of next-generation AI applications and infrastructures.