Breadth-First Pipeline Parallelism (2211.05953v2)

Published 11 Nov 2022 in cs.DC, cs.AI, cs.CL, and cs.LG

Abstract: We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (24)

Authors (1)

Joel Lamy-Poirier (9 papers)

Citations (1)

View on Semantic Scholar

Breadth-First Pipeline Parallelism (2211.05953v2)

Related Papers