Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Decentralized Training of Foundation Models in Heterogeneous Environments (2206.01288v4)

Published 2 Jun 2022 in cs.DC and cs.LG

Abstract: Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast, homogeneous interconnects and using carefully designed software systems that support both data parallelism and model/pipeline parallelism. Such dedicated clusters can be costly and difficult to obtain. Can we instead leverage the much greater amount of decentralized, heterogeneous, and lower-bandwidth interconnected compute? Previous works examining the heterogeneous, decentralized setting focus on relatively small models that can be trained in a purely data parallel manner. State-of-the-art schemes for model parallel foundation model training, such as Megatron, only consider the homogeneous data center setting. In this paper, we present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network. Our key technical contribution is a scheduling algorithm that allocates different computational "tasklets" in the training of foundation models to a group of decentralized GPU devices connected by a slow heterogeneous network. We provide a formal cost model and further propose an efficient evolutionary algorithm to find the optimal allocation strategy. We conduct extensive experiments that represent different scenarios for learning over geo-distributed devices simulated using real-world network measurements. In the most extreme case, across 8 different cities spanning 3 continents, our approach is 4.8X faster than prior state-of-the-art training systems (Megatron).

Decentralized Training of Foundation Models in Heterogeneous Environments

The paper under review presents a novel approach to training large-scale foundation models (FMs), such as GPT-3 and PaLM, in decentralized and heterogeneous computing environments. These models typically require significant computational resources, traditionally sourced from clusters within homogeneous data centers featuring fast interconnects. The research explores whether these computational demands can be met using the distributed and varied capabilities of decentralized computing resources, which have become increasingly prevalent and often underutilized.

Key Contributions

The primary contribution of the paper is the introduction of a new scheduling algorithm optimized for training foundation models in decentralized settings. This algorithm aims to efficiently allocate computational tasks, called "tasklets," across a network of decentralized GPU devices connected by slower, heterogeneous networks. The scheduling algorithm is built upon a formal cost model that considers both data and pipeline parallelism in task allocation, a significant departure from previous approaches that focus mainly on data parallelism for smaller models.

The paper proposes an evolutionary algorithm to determine optimal tasklet allocations, aiming to minimize communication and computational overhead. The proposed algorithm was tested using real-world network measurements across geo-distributed environments simulating connections between decentralized devices.

Experimental Results

The experiments conducted demonstrate the efficiency of the proposed method, particularly under extreme conditions. When deployed across devices in eight cities spanning three continents, the new approach yielded a training time 4.8 times faster than existing state-of-the-art systems. This result underscores the capability of the scheduling algorithm to mitigate the limitations posed by slower and heterogeneous communication networks. Furthermore, the implementation showed only a 1.7 to 3.5 times slowdown compared to homogeneous data center training, despite the network being up to 100 times slower, indicating promising scalability and adaptability in more constrained resource environments.

Implications

The practical implications of this research are significant, suggesting that training large-scale models need not be confined to highly specialized data centers. By leveraging decentralized computational resources, the costs associated with training these models can be dramatically reduced, democratizing access to foundation model development. This could have vast implications for smaller institutions or researchers with limited computing resources, potentially accelerating innovation in machine learning by removing existing economic barriers.

From a theoretical standpoint, the research advances our understanding of model and pipeline parallelism in decentralized settings and provides a framework for future explorations in distributed machine learning. The proposed model opens new avenues for research into optimizing communication and computational processes across disparate and heterogeneous device networks.

Future Directions

The paper acknowledges several limitations and areas for future research. Dynamic scheduling that accounts for changing network conditions and device availabilities remains an open challenge. Additionally, the system currently assumes stable connections and consistent device availability, which may not always be the case in volunteer computing contexts. Developing robust fault-tolerant systems to handle these real-world uncertainties will be essential in future implementations.

In conclusion, this paper presents an innovative approach to decentralized training of foundation models in heterogeneous environments. The results suggest that such methodologies can bridge the gap between centralized and decentralized training, paving the way for more inclusive and economically accessible AI development. This work lays a critical foundation for subsequent investigations into optimizing distributed resources for large-scale ML training tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  4. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  5. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  6. Fairscale: A general purpose modular pytorch library for high performance and large scale training, 2021.
  7. Gspmd: general and scalable parallelization for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021.
  8. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning. arXiv preprint arXiv:2201.12023, 2022.
  9. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning, pages 6543–6552. PMLR, 2021.
  10. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  11. {{\{{ZeRO-Offload}}\}}: Democratizing {{\{{Billion-Scale}}\}} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 551–564, 2021.
  12. Q4’21 sees a nominal rise in gpu and pc shipments quarter-to-quarter. https://www.jonpeddie.com/press-releases/q421-sees-a-nominal-rise-in-gpu-and-pc-shipments-quarter-to-quarter.
  13. Screen savers of the world unite! Science, 290(5498):1903–1904, 2000.
  14. OS Statistics. https://stats.foldingathome.org/os, 2022. [Online; accessed 15-May-2022].
  15. Gpu economics cost analysis. https://venturebeat.com/2018/02/25/the-real-cost-of-mining-ethereum/.
  16. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in Neural Information Processing Systems, 30, 2017.
  17. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pages 3478–3487. PMLR, 2019.
  18. A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393. PMLR, 2020.
  19. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. Advances in Neural Information Processing Systems, 33:3659–3672, 2020.
  20. Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897, 2021.
  21. Glam: Efficient scaling of language models with mixture-of-experts. CoRR, abs/2112.06905, 2021. URL https://arxiv.org/abs/2112.06905.
  22. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
  23. Efficient algorithms for device placement of dnn graph operators. Advances in Neural Information Processing Systems, 33:15451–15463, 2020.
  24. Piper: Multidimensional planner for dnn parallelization. Advances in Neural Information Processing Systems, 34, 2021.
  25. Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
  26. {{\{{HetPipe}}\}}: Enabling large {{\{{DNN}}\}} training on (whimpy) heterogeneous {{\{{GPU}}\}} clusters through integration of pipelined model parallelism and data parallelism. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 307–321, 2020.
  27. Christos H Papadimitriou. The euclidean travelling salesman problem is np-complete. Theoretical computer science, 4(3):237–244, 1977.
  28. A hybrid genetic algorithm for multiway graph partitioning. In Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computation, pages 159–166. Citeseer, 2000.
  29. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  30. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
  31. A unified architecture for accelerating distributed {{\{{DNN}}\}} training in heterogeneous {{\{{GPU/CPU}}\}} clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 463–479, 2020.
  32. Michel X Goemans. Lecture notes on bipartite matching. Massachusetts Institute of Technology, 2009.
  33. Computers and intractability, volume 174. freeman San Francisco, 1979.
  34. Balanced graph partitioning. Theory of Computing Systems, 39(6):929–939, 2006.
  35. Think locally, act globally: Highly balanced graph partitioning. In International Symposium on Experimental Algorithms, pages 164–175. Springer, 2013.
  36. Recent advances in graph partitioning. Algorithm engineering, pages 117–158, 2016.
  37. Hybrid genetic algorithms: A review. Eng. Lett., 13(2):124–137, 2006.
  38. Genetic algorithm and graph partitioning. IEEE Transactions on computers, 45(7):841–855, 1996.
  39. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  40. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  41. Pipe-sgd: a decentralized pipelined sgd framework for distributed deep net training. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8056–8067, 2018.
  42. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pages 3043–3052. PMLR, 2018.
  43. Communication compression for decentralized training. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 7663–7673, 2018a.
  44. D2: Decentralized training over decentralized data. In International Conference on Machine Learning, pages 4848–4856. PMLR, 2018b.
  45. Swarm parallelism: Training large models can be surprisingly communication-efficient. 2023.
  46. Varuna: scalable, low-cost training of massive deep learning models. In Proceedings of the Seventeenth European Conference on Computer Systems, pages 472–487, 2022.
  47. David P Anderson. Boinc: A system for public-resource computing and storage. In Fifth IEEE/ACM international workshop on grid computing, pages 4–10. IEEE, 2004.
  48. Reinforcement learning in dynamic task scheduling: A review. SN Computer Science, 1(6):1–17, 2020.
  49. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  50. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  51. Pytorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12):3005–3018, 2020.
  52. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007.
  53. Nccl. https://developer.nvidia.com/nccl.
  54. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
  55. Strongswan vpn. https://www.strongswan.org/.
  56. Fluidstack. https://www.fluidstack.io/.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Binhang Yuan (45 papers)
  2. Yongjun He (12 papers)
  3. Jared Quincy Davis (10 papers)
  4. Tianyi Zhang (262 papers)
  5. Tri Dao (47 papers)
  6. Beidi Chen (61 papers)
  7. Percy Liang (239 papers)
  8. Ce Zhang (215 papers)
  9. Christopher Re (23 papers)
Citations (75)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com