Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (2201.12023v3)

Published 28 Jan 2022 in cs.LG, cs.DC, and cs.PL

Abstract: Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa

PDF Abstract

Automating Parallelism for Distributed Deep Learning: A Review of Alpa

The field of distributed deep learning demands sophisticated solutions to effectively manage and optimize the complexities associated with parallelizing large-scale neural networks across multiple computing devices. The paper "Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning" introduces Alpa, a compiler system that automates model-parallel training by generating execution plans that encompass data, operator, and pipeline parallelism. This approach addresses existing limitations in model-parallel training systems that require user-specific manual parallelization plans or that optimally operate within a restricted set of configurations.

Problem Statement and Approach

The central issue addressed by Alpa is the complexity and expertise required to manually create efficient parallel execution plans for large deep learning models. To overcome this, Alpa introduces a hierarchical space for execution plans by distinguishing between inter-operator and intra-operator parallelisms. This classification facilitates a robust method for automatic generation of parallel execution strategies that adapt to complex model architectures and distributed infrastructures.

Alpa utilizes a two-level compilation strategy:

Intra-Operator Parallelism: Employs an ILP-based formulation to optimize the execution of each operator within a device mesh, handling communication and computation trade-offs.
Inter-Operator Parallelism: Utilizes a dynamic programming approach to determine optimal stage partitioning and mesh assignment, minimizing communication overheads inherent in pipeline parallelism.

Results and Evaluation

The evaluation demonstrates Alpa's capability to generate execution plans that often match or exceed traditional hand-tuned model-parallel systems, showcasing superior performance even on systems specifically designed for particular models. Notably, Alpa achieves substantial speed-ups in training GShard Mixture-of-Experts models, demonstrating up to 9.7× improvements over DeepSpeed on multiple nodes.

Alpa's ability to generalize across heterogeneous models, such as Wide-ResNet, without requiring manual strategy adaptation is significant. The system maintains good scaling efficiency, achieving up to 80% efficiency on distributed resources.

Implications and Future Directions

The automatic parallelization capabilities presented by Alpa hold substantial promise for accelerating ML research by alleviating the burden on model developers to have deep expertise in system optimizations. The empirical results indicate that Alpa not only bridges the gap between theoretical optimization and practical application but also pushes the boundaries of what can be achieved in terms of scalability and efficiency in distributed deep learning.

Looking forward, future research could enhance Alpa by optimizing computational graph flexibility and exploring dynamic scheduling techniques that may further reduce execution latency. Moreover, extending Alpa's applicability to evolving AI architectures and hybrid computing environments remains a compelling direction to explore.

In conclusion, Alpa represents a substantial stride in reducing the complexity of scalable deep learning, offering a versatile and automated approach to model-parallel training that can adapt to diverse model and cluster configurations.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Lianmin Zheng (34 papers)
Zhuohan Li (29 papers)
Hao Zhang (948 papers)
Yonghao Zhuang (10 papers)
Zhifeng Chen (65 papers)
Yanping Huang (40 papers)
Yida Wang (62 papers)
Yuanzhong Xu (16 papers)
Danyang Zhuo (33 papers)
Eric P. Xing (192 papers)
Joseph E. Gonzalez (167 papers)
Ion Stoica (177 papers)

Citations (111)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - alpa-projects/alpa: Training and serving large-scale neural networks with auto parallelization. (3,071 stars)