Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies (2403.03699v1)
Abstract: Neural networks have become a cornerstone of machine learning. As the trend for these to get more and more complex continues, so does the underlying hardware and software infrastructure for training and deployment. In this survey we answer three research questions: "What types of model parallelism exist?", "What are the challenges of model parallelism?", and "What is a modern use-case of model parallelism?" We answer the first question by looking at how neural networks can be parallelised and expressing these as operator graphs while exploring the available dimensions. The dimensions along which neural networks can be parallelised are intra-operator and inter-operator. We answer the second question by collecting and listing both implementation challenges for the types of parallelism, as well as the problem of optimally partitioning the operator graph. We answer the last question by collecting and listing how parallelism is applied in modern multi-billion parameter transformer networks, to the extend that this is possible with the limited information shared about these networks.
- “Placeto: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning”, 2019 DOI: 10.48550/arXiv.1906.08879
- “PaLM 2 Technical Report”, 2023 DOI: 10.48550/arXiv.2305.10403
- “Pathways: Asynchronous Distributed Dataflow for ML” In Proceedings of Machine Learning and Systems, 2022
- “Language Models are Few-Shot Learners” In Advances in Neural Information Processing Systems, 2020
- “TensorOpt: Exploring the Tradeoffs in Distributed DNN Training With Auto-Parallelism”, 2022 DOI: 10.1109/TPDS.2021.3132413
- “Training Deep Nets with Sublinear Memory Cost”, 2016 DOI: 10.48550/arXiv.1604.06174
- “NVIDIA A100 Tensor Core GPU: Performance and Innovation”, 2021 DOI: 10.1109/MM.2021.3061394
- “PaLM: Scaling Language Modeling with Pathways”, 2023
- “Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism” In 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021
- “PipeDream: Fast and Efficient Pipeline Parallel DNN Training”, 2018 DOI: 10.48550/arXiv.1806.03377
- “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism” In Advances in Neural Information Processing Systems, 2019
- “Exploring Hidden Dimensions in Accelerating Convolutional Neural Networks” In Proceedings of the 35th International Conference on Machine Learning, 2018
- Zhihao Jia, Matei Zaharia and Alex Aiken “Beyond Data and Model Parallelism for Deep Neural Networks.” In Proceedings of Machine Learning and Systems, 2019
- “Reducing activation recomputation in large transformer models” In Proceedings of Machine Learning and Systems, 2023
- “HW-NAS-Bench:Hardware-Aware Neural Architecture Search Benchmark”, 2021 DOI: 10.48550/arXiv.2103.10584
- “A Survey on Auto-Parallelism of Large-Scale Deep Learning Training”, 2023 DOI: 10.1109/TPDS.2023.3281931
- “NAS-Bench-Suite: NAS Evaluation is (Now) Surprisingly Easy”, 2022 DOI: 10.48550/arXiv.2201.13396
- Orlando Moreira, Merten Popp and Christian Schulz “Evolutionary multi-level acyclic graph partitioning” In Proceedings of the Genetic and Evolutionary Computation Conference, 2018 DOI: 10.1145/3205455.3205464
- Orlando Moreira, Merten Popp and Christian Schulz “Graph Partitioning with Acyclicity Constraints”, 2017 DOI: 10.48550/arXiv.1704.00705
- “PipeDream: generalized pipeline parallelism for DNN training” In Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019 DOI: 10.1145/3341301.3359646
- “Efficient large-scale language model training on GPU clusters using megatron-LM” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021 DOI: 10.1145/3458817.3476209
- “GPT-4 Technical Report”, 2023 DOI: 10.48550/arXiv.2303.08774
- “Scaling Language Models: Methods, Analysis & Insights from Training Gopher”, 2022 DOI: 10.48550/arXiv.2112.11446
- “DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters” In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020 DOI: 10.1145/3394486.3406703
- Kirk Schloegel, George Karypis and Vipin Kumar “Parallel static and dynamic multi-constraint graph partitioning” In Concurrency and Computation: Practice and Experience, 2002 DOI: 10.1002/cpe.605
- “Compute Trends Across Three Eras of Machine Learning” In 2022 International Joint Conference on Neural Networks (IJCNN), 2022 DOI: 10.1109/IJCNN55064.2022.9891914
- “Mesh-TensorFlow: Deep Learning for Supercomputers” In Advances in Neural Information Processing Systems, 2018
- “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism”, 2020 DOI: 10.48550/arXiv.1909.08053
- “Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model”, 2022 DOI: 10.48550/arXiv.2201.11990
- “Automatic Graph Partitioning for Very Large-scale Deep Learning” In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2021 DOI: 10.1109/IPDPS49936.2021.00109
- “Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization” In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022
- “Efficient and Systematic Partitioning of Large and Deep Neural Networks for Parallelization” In Euro-Par 2021: Parallel Processing, 2021 DOI: 10.1007/978-3-030-85665-6˙13
- Minjie Wang, Chien-chin Huang and Jinyang Li “Supporting Very Large Models using Automatic Dataflow Graph Partitioning” In Proceedings of the Fourteenth EuroSys Conference 2019, 2019 DOI: 10.1145/3302424.3303953
- “GSPMD: General and Scalable Parallelization for ML Computation Graphs”, 2021 DOI: 10.48550/arXiv.2105.04663
- “Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning” In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.