Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 105 tok/s Pro

Kimi K2 202 tok/s Pro

GPT OSS 120B 461 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies (2403.03699v1)

Published 6 Mar 2024 in cs.DC and cs.LG

Abstract: Neural networks have become a cornerstone of machine learning. As the trend for these to get more and more complex continues, so does the underlying hardware and software infrastructure for training and deployment. In this survey we answer three research questions: "What types of model parallelism exist?", "What are the challenges of model parallelism?", and "What is a modern use-case of model parallelism?" We answer the first question by looking at how neural networks can be parallelised and expressing these as operator graphs while exploring the available dimensions. The dimensions along which neural networks can be parallelised are intra-operator and inter-operator. We answer the second question by collecting and listing both implementation challenges for the types of parallelism, as well as the problem of optimally partitioning the operator graph. We answer the last question by collecting and listing how parallelism is applied in modern multi-billion parameter transformer networks, to the extend that this is possible with the limited information shared about these networks.

References (35)

Citations (5)

View on Semantic Scholar

Collections

Summary

The paper categorizes model parallelism into inter-operator and intra-operator strategies, providing a detailed evaluation of their computational trade-offs.
The paper addresses challenges such as pipeline stalls and high communication overhead, emphasizing the need for optimal partitioning and accurate network modeling.
The paper illustrates practical applications in large language models like Megatron-LM, Gopher, and PaLM, demonstrating hybrid strategies for scalable training.

The paper "Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies" (2403.03699) provides a comprehensive examination of model parallelism, a key technique to efficiently train large neural networks distributed across multiple processing units.

Key Aspects of the Paper:

Types of Model Parallelism:
- The paper categorizes model parallelism into two principal types: inter-operator and intra-operator parallelism.
  - Inter-Operator Parallelism: Involves partitioning the computation graph into segments, each assigned to a different device. Communication occurs at the boundaries between these segments.
  - Intra-Operator Parallelism: Involves parallelizing computations within a single node of the graph by distributing the operations within an operator across multiple devices.
- Hybrid strategies often combine these types to optimize resource use and performance.
Challenges in Model Parallelism:
- Efficient utilization of resources, particularly to avoid pipeline stalls in inter-operator parallelism, is a significant challenge.
- High communication overhead is noted for intra-operator parallelism, specifically due to the data movement required for scattering and gathering operations across devices.
- The complexity of optimally partitioning the operator graph and evaluating various partitioning strategies is another discussed challenge.
- The paper highlights the ongoing difficulty in modeling communication times accurately, involving considerations of network bandwidth and latency.
Modern Use-Cases in LLMs:
- The paper uses LLMs, such as Transformers, as primary examples of model parallelism in practice.
  - Megatron-LM: Demonstrates the use of intra-layer parallelism to distribute the MLPs and self-attention blocks across GPUs and combines this with inter-layer parallelism.
  - Gopher and PaLM: Highlight the use of custom hardware like TPUs, utilizing both intra- and inter-layer parallelisms along with data parallelism to enhance training scalability.
  - GPT Models: Suggests the probable use of model parallelism, although specifics remain undisclosed, emphasizing the challenges due to limited public information on architecture specifics.

Complementary Literature:

Recent research provides deeper insights and complements the survey:

Megatron-LM (Shoeybi et al., 2019): Focuses on splitting computations within transformer layers across GPUs, emphasizing communication optimizations and fusing operations to minimize latency.
Alpa (Zheng et al., 2022): Automates the combination of inter- and intra-operator parallelisms, using a hierarchical search space to derive efficient execution plans through tailored compilation passes.
Zero Bubble Pipeline Parallelism (Qi et al., 2023): Proposes techniques to eliminate pipeline stalls, splitting backward computations and optimizing scheduling to ensure consistent utilization of resources.

These papers, alongside methods such as Ring Self-Attention for handling large sequence lengths (Li et al., 2021), collectively enhance understanding of how LLMs are trained across vast distributed systems efficiently, highlighting innovations that tackle the multitude of challenges associated with model parallelism.