TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models (2102.07988v2)

Published 16 Feb 2021 in cs.LG, cs.CL, and cs.DC

Abstract: Model parallelism has become a necessity for training modern large-scale deep LLMs. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based LLMs thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based LLMs. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods. The code for reproduction can be found at https://github.com/zhuohan123/terapipe

Authors (7)

Zhuohan Li (29 papers)
Siyuan Zhuang (9 papers)
Shiyuan Guo (1 paper)
Danyang Zhuo (33 papers)
Hao Zhang (948 papers)
Dawn Song (229 papers)
Ion Stoica (177 papers)

Citations (97)

View on Semantic Scholar

Summary

Token-Level Pipeline Parallelism for Large-Scale LLMs: Insights on TeraPipe

The ever-increasing scale of LLMs necessitates innovative parallelism strategies to address computational challenges. The paper "TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale LLMs" introduces a significant advancement in this domain by proposing a novel approach, TeraPipe, that harnesses the autoregressive property of Transformer models to enable token-level pipeline parallelism.

Overview of TeraPipe

Traditional model-parallel training approaches, like partitioning weight matrices (as in Megatron-LM) or microbatch-based pipelining (exemplified by GPipe), face bottlenecks due to inter-chip communication and inefficiencies introduced by pipeline bubbles. TeraPipe circumvents these limitations by emphasizing token-level pipeline parallelism within a single training sequence, leveraging the autoregressive nature of Transformers. The core idea is to process tokens in a sequence in a staggered manner across distributed devices; thus, different parts of the sequence are processed concurrently but staggered in time, creating a finer-grained parallelism.

Methodology

TeraPipe innovatively uses dynamic programming to determine the optimal token partitioning scheme that maximizes training throughput for a given model and hardware configuration. This approach effectively minimizes the underutilization of resources, a significant issue when employing pipeline parallelism. The key technical contribution is the novel partitioning algorithm that accounts for variable computational loads across tokens due to the transformer architecture, which traditionally challenges uniform partitioning schemes.

Key Results

The implementation of TeraPipe demonstrated a remarkable 5.0x speedup in training the largest variant of GPT-3 (175 billion parameters) on a cluster of 48 AWS p3.16xlarge instances. The paper provides detailed analysis across multiple configurations of GPT-3 models, emphasizing that the performance gain from TeraPipe increases as the model size and complexity grow. Its compatibility with existing model-parallel methods allows further enhancements in training efficiencies.

Implications and Future Directions

By providing a substantial boost in training performance, TeraPipe not only contributes a new dimension to model parallelism but also sets a precedent for future research aimed at the efficient training of large LMs. The authors’ work underscores the potential of token-level parallelism to mitigate the limitations introduced by large model and sequence sizes. This could prompt further exploration into even more granular levels of parallelism or hybrid strategies combining different dimensions of parallelism for other model architectures or even for bidirectional models like BERT.

Conclusion

TeraPipe represents a strategic advancement in parallel training methods for large-scale LMs, addressing contemporary scalability and efficiency challenges. It exemplifies how a deep understanding of model architectures, combined with sophisticated algorithmic design, can significantly enhance computational performance. This paper's contributions are not only of immediate practical relevance for training large models but also lay the groundwork for future explorations in efficient ML system design.

PDF Markdown