Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 70 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies (2403.03699v1)

Published 6 Mar 2024 in cs.DC and cs.LG

Abstract: Neural networks have become a cornerstone of machine learning. As the trend for these to get more and more complex continues, so does the underlying hardware and software infrastructure for training and deployment. In this survey we answer three research questions: "What types of model parallelism exist?", "What are the challenges of model parallelism?", and "What is a modern use-case of model parallelism?" We answer the first question by looking at how neural networks can be parallelised and expressing these as operator graphs while exploring the available dimensions. The dimensions along which neural networks can be parallelised are intra-operator and inter-operator. We answer the second question by collecting and listing both implementation challenges for the types of parallelism, as well as the problem of optimally partitioning the operator graph. We answer the last question by collecting and listing how parallelism is applied in modern multi-billion parameter transformer networks, to the extend that this is possible with the limited information shared about these networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. “Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning”, 2019 DOI: 10.48550/arXiv.1906.08879
  2. “PaLM 2 Technical Report”, 2023 DOI: 10.48550/arXiv.2305.10403
  3. “Pathways: Asynchronous Distributed Dataflow for ML” In Proceedings of Machine Learning and Systems, 2022
  4. “Language Models are Few-Shot Learners” In Advances in Neural Information Processing Systems, 2020
  5. “TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism”, 2022 DOI: 10.1109/TPDS.2021.3132413
  6. “Training Deep Nets with Sublinear Memory Cost”, 2016 DOI: 10.48550/arXiv.1604.06174
  7. “NVIDIA A100 Tensor Core GPU: Performance and Innovation”, 2021 DOI: 10.1109/MM.2021.3061394
  8. “PaLM: Scaling Language Modeling with Pathways”, 2023
  9. “Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism” In 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021
  10. “PipeDream: Fast and Efficient Pipeline Parallel DNN Training”, 2018 DOI: 10.48550/arXiv.1806.03377
  11. “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism” In Advances in Neural Information Processing Systems, 2019
  12. “Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks” In Proceedings of the 35th International Conference on Machine Learning, 2018
  13. Zhihao Jia, Matei Zaharia and Alex Aiken “Beyond Data and Model Parallelism for Deep Neural Networks.” In Proceedings of Machine Learning and Systems, 2019
  14. “Reducing activation recomputation in large transformer models” In Proceedings of Machine Learning and Systems, 2023
  15. “HW-NAS-Bench:Hardware-Aware Neural Architecture Search Benchmark”, 2021 DOI: 10.48550/arXiv.2103.10584
  16. “A Survey on Auto-Parallelism of Large-Scale Deep Learning Training”, 2023 DOI: 10.1109/TPDS.2023.3281931
  17. “NAS-Bench-Suite: NAS Evaluation is (Now) Surprisingly Easy”, 2022 DOI: 10.48550/arXiv.2201.13396
  18. Orlando Moreira, Merten Popp and Christian Schulz “Evolutionary multi-level acyclic graph partitioning” In Proceedings of the Genetic and Evolutionary Computation Conference, 2018 DOI: 10.1145/3205455.3205464
  19. Orlando Moreira, Merten Popp and Christian Schulz “Graph Partitioning with Acyclicity Constraints”, 2017 DOI: 10.48550/arXiv.1704.00705
  20. “PipeDream: generalized pipeline parallelism for DNN training” In Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019 DOI: 10.1145/3341301.3359646
  21. “Efficient large-scale language model training on GPU clusters using megatron-LM” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021 DOI: 10.1145/3458817.3476209
  22. “GPT-4 Technical Report”, 2023 DOI: 10.48550/arXiv.2303.08774
  23. “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, 2022 DOI: 10.48550/arXiv.2112.11446
  24. “DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters” In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020 DOI: 10.1145/3394486.3406703
  25. Kirk Schloegel, George Karypis and Vipin Kumar “Parallel static and dynamic multi-constraint graph partitioning” In Concurrency and Computation: Practice and Experience, 2002 DOI: 10.1002/cpe.605
  26. “Compute Trends Across Three Eras of Machine Learning” In 2022 International Joint Conference on Neural Networks (IJCNN), 2022 DOI: 10.1109/IJCNN55064.2022.9891914
  27. “Mesh-TensorFlow: Deep Learning for Supercomputers” In Advances in Neural Information Processing Systems, 2018
  28. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”, 2020 DOI: 10.48550/arXiv.1909.08053
  29. “Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model”, 2022 DOI: 10.48550/arXiv.2201.11990
  30. “Automatic Graph Partitioning for Very Large-scale Deep Learning” In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021 DOI: 10.1109/IPDPS49936.2021.00109
  31. “Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization” In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022
  32. “Efficient and Systematic Partitioning of Large and Deep Neural Networks for Parallelization” In Euro-Par 2021: Parallel Processing, 2021 DOI: 10.1007/978-3-030-85665-6˙13
  33. Minjie Wang, Chien-chin Huang and Jinyang Li “Supporting Very Large Models using Automatic Dataflow Graph Partitioning” In Proceedings of the Fourteenth EuroSys Conference 2019, 2019 DOI: 10.1145/3302424.3303953
  34. “GSPMD: General and Scalable Parallelization for ML Computation Graphs”, 2021 DOI: 10.48550/arXiv.2105.04663
  35. “Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning” In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022
Citations (5)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper categorizes model parallelism into inter-operator and intra-operator strategies, providing a detailed evaluation of their computational trade-offs.
  • The paper addresses challenges such as pipeline stalls and high communication overhead, emphasizing the need for optimal partitioning and accurate network modeling.
  • The paper illustrates practical applications in large language models like Megatron-LM, Gopher, and PaLM, demonstrating hybrid strategies for scalable training.

The paper "Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies" (2403.03699) provides a comprehensive examination of model parallelism, a key technique to efficiently train large neural networks distributed across multiple processing units.

Key Aspects of the Paper:

  1. Types of Model Parallelism:
    • The paper categorizes model parallelism into two principal types: inter-operator and intra-operator parallelism.
      • Inter-Operator Parallelism: Involves partitioning the computation graph into segments, each assigned to a different device. Communication occurs at the boundaries between these segments.
      • Intra-Operator Parallelism: Involves parallelizing computations within a single node of the graph by distributing the operations within an operator across multiple devices.
    • Hybrid strategies often combine these types to optimize resource use and performance.
  2. Challenges in Model Parallelism:
    • Efficient utilization of resources, particularly to avoid pipeline stalls in inter-operator parallelism, is a significant challenge.
    • High communication overhead is noted for intra-operator parallelism, specifically due to the data movement required for scattering and gathering operations across devices.
    • The complexity of optimally partitioning the operator graph and evaluating various partitioning strategies is another discussed challenge.
    • The paper highlights the ongoing difficulty in modeling communication times accurately, involving considerations of network bandwidth and latency.
  3. Modern Use-Cases in LLMs:
    • The paper uses LLMs, such as Transformers, as primary examples of model parallelism in practice.
      • Megatron-LM: Demonstrates the use of intra-layer parallelism to distribute the MLPs and self-attention blocks across GPUs and combines this with inter-layer parallelism.
      • Gopher and PaLM: Highlight the use of custom hardware like TPUs, utilizing both intra- and inter-layer parallelisms along with data parallelism to enhance training scalability.
      • GPT Models: Suggests the probable use of model parallelism, although specifics remain undisclosed, emphasizing the challenges due to limited public information on architecture specifics.

Complementary Literature:

Recent research provides deeper insights and complements the survey:

  • Megatron-LM (Shoeybi et al., 2019): Focuses on splitting computations within transformer layers across GPUs, emphasizing communication optimizations and fusing operations to minimize latency.
  • Alpa (Zheng et al., 2022): Automates the combination of inter- and intra-operator parallelisms, using a hierarchical search space to derive efficient execution plans through tailored compilation passes.
  • Zero Bubble Pipeline Parallelism (Qi et al., 2023): Proposes techniques to eliminate pipeline stalls, splitting backward computations and optimizing scheduling to ensure consistent utilization of resources.

These papers, alongside methods such as Ring Self-Attention for handling large sequence lengths (Li et al., 2021), collectively enhance understanding of how LLMs are trained across vast distributed systems efficiently, highlighting innovations that tackle the multitude of challenges associated with model parallelism.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.