Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism (2211.13878v1)

Published 25 Nov 2022 in cs.LG, cs.DB, and cs.DC

Abstract: Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.

PDF Abstract

Insights into Efficient Transformer Training via Galvatron's Automatic Parallelism

The paper "Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism" addresses a pressing challenge within the domain of deep learning: efficiently training large Transformer models across multiple GPUs. The solution presented manifests in a system named Galvatron, which innovatively leverages multiple dimensions of parallelism to optimize training performance.

Overview of Galvatron's Approach

The need for efficient training mechanisms is underscored by the significant computational demands imposed by large-scale Transformer models, evident in their application across NLP, CV, and beyond. Galvatron responds to this need by automatically determining the optimum hybrid parallelism strategy, capitalizing on a robust framework that incorporates multiple parallelism paradigms—data parallelism (DP), sharded data parallelism (SDP), tensor parallelism (TP), and pipeline parallelism (PP).

Galvatron stands out by integrating a decision-tree structure to navigate the expansive search space inherent in training these models. The system uses dynamic programming to identify and execute these parallelization strategies, optimizing for both throughput and memory constraints.

Strong Numerical Results

In the extensive evaluation across various Transformer workloads—spanning models such as BERT, T5, and ViT—Galvatron demonstrates superior throughput under various GPU memory budgets, consistently outperforming contemporary approaches. For instance, when tasked with the ViT-Huge-32 model under stringent memory constraints, Galvatron's throughput was shown to be significantly higher than that achieved by standalone or limited-dimension parallelisms.

The evaluation articulated in the paper quantifies substantial efficiency gains, with improvements reaching up to 338% in certain configurations when compared to existing methods. Such quantitative evidence positions Galvatron as an eminent solution for overcoming the bottlenecks associated with large-scale distributed model training.

Implications for Future Research and Development

The implications of Galvatron's successful strategy lie in its potential to influence the design and training of emerging large-scale DL models. By providing a scalable solution that conservatively allocates compute resources through automated, hybrid parallelism, Galvatron could feasibly decrease the computational costs associated with the adoption of larger Transformer models in practice.

Moreover, the innovative decision-tree framework and dynamic programming approach set a precedent for future research efforts seeking optimization in multi-GPU settings. Given the explosive growth of Transformer applications and architectural complexity, future endeavors may further extend Galvatron’s principles to heterogeneous systems or environments with complex communication topologies.

In summary, the paper's work on Galvatron not only contributes to enhancing the contemporary understanding of multi-GPU parallelism but also opens promising avenues for reducing bottlenecks in burgeoning DL landscapes. As Transformer models continue to evolve, so too will the necessity for such sophisticated, automated frameworks that optimize training efficiency at scale.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Xupeng Miao (37 papers)
Yujie Wang (103 papers)
Youhe Jiang (13 papers)
Chunan Shi (4 papers)
Xiaonan Nie (20 papers)
Hailin Zhang (51 papers)
Bin Cui (165 papers)

Citations (41)

View on Semantic Scholar

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism (2211.13878v1)

Insights into Efficient Transformer Training via Galvatron's Automatic Parallelism

Overview of Galvatron's Approach

Strong Numerical Results

Implications for Future Research and Development

Related Papers

GitHub

YouTube