Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices (2401.01728v2)

Published 3 Jan 2024 in cs.LG, cs.AI, and cs.DC

Abstract: Modern deep learning models, growing larger and more complex, have demonstrated exceptional generalization and accuracy due to training on huge datasets. This trend is expected to continue. However, the increasing size of these models poses challenges in training, as traditional centralized methods are limited by memory constraints at such scales. This paper proposes an asynchronous decentralized training paradigm for large modern deep learning models that harnesses the compute power of regular heterogeneous PCs with limited resources connected across the internet to achieve favourable performance metrics. Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters with similar data transfer rates and compute capabilities, without necessitating that each node hosts the entire model. These clusters engage in $\textit{Zero-Bubble Asynchronous Model Parallel}$ training, and a $\textit{Parallel Multi-Ring All-Reduce}$ method is employed to effectively execute global parameter averaging across all clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates and derived an optimal convergence rate of $O\left(\frac{1}{\sqrt{K}}\right)$. We further discuss linear speedup with respect to the number of participating clusters and the bound on the staleness parameter.

Introduction to Ravnest

The rapidly advancing field of deep learning is accompanied by growing model complexities and increased computational demands. Deep learning models such as LLMs and multi-modal architectures necessitate powerful hardware to handle their extensive training requirements. Traditional centralized training methods, relying on each system to contain a full model copy, are becoming progressively unfit for this task. This paper presents a novel approach titled "Ravnest," which merges the benefits of data and model parallelism to train complex models in a decentralized, asynchronous fashion without placing excessive strain on hardware.

Efficient Asynchronous Training

Ravnest utilizes numerous PCs with varied capabilities, synergizing them into clusters that share similar data transfer capacities. Through what's termed "Zero-Bubble Asynchronous Model Parallel training," the clusters - within which the global model is divided such that each node trains only a segment of the model - manage their portions of the computational load. Adding to this, a "Parallel Multi-Ring All-Reduce" method is adopted for the purpose of parameter averaging across clusters. Significant because it eliminates the need for models to be present entirely on each node, Ravnest adeptly handles substantial data sets on modest systems, effectively bolstering efficiency and inclusivity in deep learning research.

Theoretical Underpinnings and Contributions

Delving into the mechanics of Ravnest, this research extrapolates that using stochastic gradient descent in a block-structured optimization problem allows for a viable convergence rate. This theoretical underpinning showcases Ravnest's ability to maintain a robust update path despite the delay or 'staleness' associated with asynchronous updates. The paper outlines several key contributions, including a practical convergence rate analysis, evidence of linear speedup with respect to the number of clusters, and insights into how computational delays can be harnessed to enhance the overall process.

Implementation Insights

Ravnest's implementation hinges on the precise formation of clusters. This involves gauging each participant's available resources (in this context, RAM and bandwidth) to create a harmonized collective capable of undertaking efficient distributed training. A genetic algorithm is described for sorting the nodes into clusters, which then adaptively manage joiners or leavers throughout the training. Additionally, fault-tolerant behaviors are incorporated to ensure continuity and reliability in a naturally erratic internet-based environment.

In conclusion, Ravnest's methodology offers an innovative avenue for distributed machine learning, alleviating the hardware barriers that often impede progress in this domain. It represents a considerable stride towards democratizing the development and training of high-caliber deep learning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. A survey on evaluation of large language models, 2023.
  4. Solving the straggler problem with bounded staleness. In 14th Workshop on Hot Topics in Operating Systems (HotOS XIV), 2013.
  5. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  6. Asynchronous decentralized distributed training of acoustic models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3565–3576, 2021.
  7. Efficient all-reduce for distributed dnn training in optical interconnect systems. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 422–424, 2023.
  8. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012.
  9. Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897, 2021.
  10. Slow and stale gradients can win the race. IEEE Journal on Selected Areas in Information Theory, 2(3):1012–1024, 2021.
  11. Communication efficient framework for decentralized machine learning. In 2020 54th Annual Conference on Information Sciences and Systems (CISS), pages 1–5. IEEE, 2020.
  12. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  13. Genetic algorithm: a metaheuristic approach of optimization. cf there, pages 27–4, 2020.
  14. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  15. Xpipe: Efficient pipeline model parallelism for multi-gpu dnn training. arXiv preprint arXiv:1911.04610, 2019.
  16. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018.
  17. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  18. Sylvain Jeaugey. NCCL 2.0. https://on-demand.gputechconf.com/gtc/2017/presentation/s7155-jeaugey-nccl.pdf, 2017.
  19. A review on genetic algorithm: past, present, and future. Multimedia tools and applications, 80:8091–8126, 2021.
  20. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on operating systems design and implementation (OSDI 14), pages 583–598, 2014.
  21. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
  22. Communication efficient decentralized training with multiple local updates. arXiv preprint arXiv:1910.09126, 5:6, 2019.
  23. Asynchronous parallel stochastic gradient for nonconvex optimization. Advances in neural information processing systems, 28, 2015.
  24. Mlperf training benchmark. Proceedings of Machine Learning and Systems, 2:336–349, 2020.
  25. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
  26. NVIDIA. Collective Communications Library (NCCL). https://developer.nvidia.com/nccl, 2019.
  27. Pascutto, Gian-Carlo. Leela zero.
  28. Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2):117–124, 2009.
  29. Rolf Rabenseifner. Optimization of collective reduction operations. In Computational Science-ICCS 2004: 4th International Conference, Kraków, Poland, June 6-9, 2004, Proceedings, Part I 4, pages 1–9. Springer, 2004.
  30. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems, 24, 2011.
  31. Moshpit sgd: Communication-efficient decentralized training on heterogeneous unreliable devices. Advances in Neural Information Processing Systems, 34:18195–18211, 2021.
  32. Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. Advances in Neural Information Processing Systems, 33:3659–3672, 2020.
  33. Optimal convergence rates for convex distributed optimization in networks. Journal of Machine Learning Research, 20:1–31, 2019.
  34. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
  35. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112):1–49, 2019.
  36. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  37. Large-message size allreduce at wire speed for distributed deep learning. In Poster session presented at SC18, and Analysis, 2018.
  38. Exhaustive study of hierarchical allreduce patterns for large messages between gpus. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 430–439. IEEE, 2019.
  39. A dual approach for optimal algorithms in distributed optimization over networks. In 2020 Information Theory and Applications Workshop (ITA), pages 1–37. IEEE, 2020.
  40. Leslie G Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, 1990.
  41. Machine learning model sizes and the parameter gap. arXiv preprint arXiv:2207.02852, 2022.
  42. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  43. Deploying and scaling distributed parallel deep neural networks on the tianhe-3 prototype system. Scientific Reports, 11(1):20244, 2021.
  44. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
  45. On the acceleration of deep learning model parallelism with staleness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2088–2097, 2020.
  46. A (DP)2superscriptDP2(\text{DP})^{2}( DP ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT sgd: Asynchronous decentralized parallel stochastic gradient descent with differential privacy. IEEE transactions on pattern analysis and machine intelligence, 44(11):8036–8047, 2021.
  47. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems, 3:269–296, 2021.
  48. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
  49. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
  50. Staleness-aware async-sgd for distributed deep learning. arXiv preprint arXiv:1511.05950, 2015.
  51. Distributed hierarchical gpu parameter server for massive scale deep learning ads systems. Proceedings of Machine Learning and Systems, 2:412–428, 2020.
  52. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  53. Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning, pages 4120–4129. PMLR, 2017.
  54. On optimizing the communication of model parallelism. Proceedings of Machine Learning and Systems, 5, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Anirudh Rajiv Menon (3 papers)
  2. Unnikrishnan Menon (5 papers)
  3. Kailash Ahirwar (1 paper)