A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training (2303.06318v2)

Published 11 Mar 2023 in cs.LG, cs.AI, cs.DC, and cs.PF

Abstract: Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4 to 8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.

PDF Abstract

Overview of DeepSpeed-TED: Hybrid Parallelism in MoE Models

The paper explores the challenges and solutions associated with training Mixture-of-Experts (MoE) models, which have gained traction due to their ability to increase parameter count without proportionate computational costs. While traditional frameworks fall short when accommodating large base models in MoE structures, the authors present a novel approach—DeepSpeed-TED. This work integrates three dimensions of parallelism: tensor, expert, and data, thereby facilitating training on significantly larger base models compared to prior solutions.

Core Contributions

Three-Dimensional Hybrid Parallelism: The paper introduces DeepSpeed-TED, which combines tensor parallelism from Megatron-LM, expert parallelism from DeepSpeed-MoE, and data parallelism from ZeRO. This integration permits the training of MoE models with base models that are 4-8 times larger than those supported by current state-of-the-art frameworks.
Memory and Communication Optimizations: The framework addresses two critical bottlenecks:
- Memory Usage: By implementing a tiled optimizer, the solution alleviates memory spikes during the optimizer step. This allows models to train successfully even with a substantial increase in the number of experts and base model size.
- Communication Overhead: To mitigate increased communication costs, the authors propose Duplicate Token Dropping (DTD) and Communication-aware Activation Checkpointing (CAC). These optimizations reduce unnecessary data movement and lower collective communication times, respectively.
Empirical Evaluation: The framework's efficacy is underscored through extensive experiments on Summit and ThetaGPU supercomputers. Results indicate significant improvements in hardware efficiency and training speed, particularly when applying DTD and CAC.

Technical Approach

The paper elaborates on distinct dimensions of parallelism being used for different components of MoE models. Notably:

Tensor Parallelism: Partitioning computation at a layer level across GPUs to allow large base models to be manageable.
Expert Parallelism: Allocating experts across GPUs in an embarrassingly parallel manner.
Data Parallelism: Utilizing ZeRO to efficiently shard optimizer states, further decreasing memory consumption.

The new partitioning approach resolves two observed bottlenecks—inefficient memory usage during optimizer operations and excessive communication across GPUs—both exacerbated by increasing expert counts and base model sizes.

Results and Implications

The paper demonstrates that DeepSpeed-TED supports MoE models with several billion parameters, achieving significant efficiency improvements:

Speedups of up to 29% in communication-intensive operations.
Supporting MoE models substantially larger than those previously feasible.

Such advancements imply that larger, highly performant MoE models can now be trained without prohibitive scaling of computational resources. This has practical implications for the scaling laws of neural networks, enabling further exploration into model architecture design and efficiency.

Future Directions

In light of these enhancements, future research could explore integrating pipeline parallelism to extend the framework's scalability. Additionally, there lies potential in further optimizing communication routines for diverse network architectures and decreasing resting state memory requirements using novel data redundancy techniques.

DeepSpeed-TED represents a significant step in distributed deep learning, offering a robust solution to the complexity of training massive MoE models while maintaining efficiency in computational and memory resources necessary for operations on supercomputing infrastructures.