2000 character limit reached
Breadth-First Pipeline Parallelism (2211.05953v2)
Published 11 Nov 2022 in cs.DC, cs.AI, cs.CL, and cs.LG
Abstract: We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.
- Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
- Palm: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
- Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/abs/2205.14135.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL https://arxiv.org/abs/1810.04805.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021. URL https://arxiv.org/abs/2101.03961.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. ArXiv, abs/1706.02677, 2017.
- Pipedream: Fast and efficient pipeline parallel dnn training, 2018. URL https://arxiv.org/abs/1806.03377.
- Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism, 2018. URL https://arxiv.org/abs/1811.06965.
- Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.
- Reducing activation recomputation in large transformer models, 2022. URL https://arxiv.org/abs/2205.05198.
- Chimera. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, nov 2021. doi: 10.1145/3458817.3476145. URL https://doi.org/10.1145%2F3458817.3476145.
- An empirical model of large-batch training, 2018. URL https://arxiv.org/abs/1812.06162.
- Efficient large-scale language model training on gpu clusters using megatron-lm, 2021. URL https://arxiv.org/abs/2104.04473.
- Language models are unsupervised multitask learners, 2019.
- Zero: Memory optimizations toward training trillion parameter models, 2019. URL https://arxiv.org/abs/1910.02054.
- Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning, 2021. URL https://arxiv.org/abs/2104.07857.
- Measuring the effects of data parallelism on neural network training, 2018. URL https://arxiv.org/abs/1811.03600.
- Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems, 31, 2018.
- Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019. URL https://arxiv.org/abs/1909.08053.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022. URL https://arxiv.org/abs/2201.11990.
- Don’t decay the learning rate, increase the batch size. ArXiv, abs/1711.00489, 2018.
- Attention is all you need, 2017. URL https://arxiv.org/abs/1706.03762.
- Opt: Open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068.
- Joel Lamy-Poirier (9 papers)