Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Breadth-First Pipeline Parallelism (2211.05953v2)

Published 11 Nov 2022 in cs.DC, cs.AI, cs.CL, and cs.LG

Abstract: We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
  2. Palm: Scaling language modeling with pathways, 2022. URL https://arxiv.org/abs/2204.02311.
  3. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. URL https://arxiv.org/abs/2205.14135.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding, 2018. URL https://arxiv.org/abs/1810.04805.
  5. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2021. URL https://arxiv.org/abs/2101.03961.
  6. Accurate, large minibatch sgd: Training imagenet in 1 hour. ArXiv, abs/1706.02677, 2017.
  7. Pipedream: Fast and efficient pipeline parallel dnn training, 2018. URL https://arxiv.org/abs/1806.03377.
  8. Training compute-optimal large language models, 2022. URL https://arxiv.org/abs/2203.15556.
  9. Gpipe: Efficient training of giant neural networks using pipeline parallelism, 2018. URL https://arxiv.org/abs/1811.06965.
  10. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.
  11. Reducing activation recomputation in large transformer models, 2022. URL https://arxiv.org/abs/2205.05198.
  12. Chimera. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, nov 2021. doi: 10.1145/3458817.3476145. URL https://doi.org/10.1145%2F3458817.3476145.
  13. An empirical model of large-batch training, 2018. URL https://arxiv.org/abs/1812.06162.
  14. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021. URL https://arxiv.org/abs/2104.04473.
  15. Language models are unsupervised multitask learners, 2019.
  16. Zero: Memory optimizations toward training trillion parameter models, 2019. URL https://arxiv.org/abs/1910.02054.
  17. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning, 2021. URL https://arxiv.org/abs/2104.07857.
  18. Measuring the effects of data parallelism on neural network training, 2018. URL https://arxiv.org/abs/1811.03600.
  19. Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems, 31, 2018.
  20. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019. URL https://arxiv.org/abs/1909.08053.
  21. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model, 2022. URL https://arxiv.org/abs/2201.11990.
  22. Don’t decay the learning rate, increase the batch size. ArXiv, abs/1711.00489, 2018.
  23. Attention is all you need, 2017. URL https://arxiv.org/abs/1706.03762.
  24. Opt: Open pre-trained transformer language models, 2022. URL https://arxiv.org/abs/2205.01068.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Joel Lamy-Poirier (9 papers)
Citations (1)