Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Parallelization Layouts for Large-Scale Distributed Model Training (2311.05610v3)

Published 9 Nov 2023 in cs.LG and cs.DC

Abstract: Efficiently training LLMs requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for LLMs. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a Llama 13B model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  2. Training deep nets with sublinear memory cost, 2016.
  3. Palm: Scaling language modeling with pathways, 2022.
  4. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  5. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
  7. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023.
  8. Reducing activation recomputation in large transformer models, 2022.
  9. Lightseq: Sequence level parallelism for distributed training of long context transformers, 2023.
  10. Colossal-ai: A unified deep learning system for large-scale parallel training, 2022.
  11. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  12. Memory-efficient pipeline-parallel DNN training. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 7937–7947. PMLR, 2021.
  13. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021.
  14. OpenAI. Gpt-4 technical report, 2023.
  15. Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
  16. Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2020.
  17. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash, editors, KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 3505–3506. ACM, 2020.
  18. Noam Shazeer. Glu variants improve transformer, 2020.
  19. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019.
  20. Roformer: Enhanced transformer with rotary position embedding, 2022.
  21. Llama: Open and efficient foundation language models, 2023.
  22. Llama 2: Open foundation and fine-tuned chat models, 2023.
  23. Leslie G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103–111, 1990.
  24. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  25. Gspmd: General and scalable parallelization for ml computation graphs, 2021.
  26. Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 12360–12371, 2019.
  27. Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023.
  28. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning, 2022.
Citations (6)

Summary

We haven't generated a summary for this paper yet.