Ravnest: Decentralized Asynchronous Training on Heterogeneous Devices (2401.01728v2)
Abstract: Modern deep learning models, growing larger and more complex, have demonstrated exceptional generalization and accuracy due to training on huge datasets. This trend is expected to continue. However, the increasing size of these models poses challenges in training, as traditional centralized methods are limited by memory constraints at such scales. This paper proposes an asynchronous decentralized training paradigm for large modern deep learning models that harnesses the compute power of regular heterogeneous PCs with limited resources connected across the internet to achieve favourable performance metrics. Ravnest facilitates decentralized training by efficiently organizing compute nodes into clusters with similar data transfer rates and compute capabilities, without necessitating that each node hosts the entire model. These clusters engage in $\textit{Zero-Bubble Asynchronous Model Parallel}$ training, and a $\textit{Parallel Multi-Ring All-Reduce}$ method is employed to effectively execute global parameter averaging across all clusters. We have framed our asynchronous SGD loss function as a block structured optimization problem with delayed updates and derived an optimal convergence rate of $O\left(\frac{1}{\sqrt{K}}\right)$. We further discuss linear speedup with respect to the number of participating clusters and the bound on the staleness parameter.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- A survey on evaluation of large language models, 2023.
- Solving the straggler problem with bounded staleness. In 14th Workshop on Hot Topics in Operating Systems (HotOS XIV), 2013.
- Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Asynchronous decentralized distributed training of acoustic models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3565–3576, 2021.
- Efficient all-reduce for distributed dnn training in optical interconnect systems. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 422–424, 2023.
- Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012.
- Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897, 2021.
- Slow and stale gradients can win the race. IEEE Journal on Selected Areas in Information Theory, 2(3):1012–1024, 2021.
- Communication efficient framework for decentralized machine learning. In 2020 54th Annual Conference on Information Sciences and Systems (CISS), pages 1–5. IEEE, 2020.
- Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- Genetic algorithm: a metaheuristic approach of optimization. cf there, pages 27–4, 2020.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Xpipe: Efficient pipeline model parallelism for multi-gpu dnn training. arXiv preprint arXiv:1911.04610, 2019.
- Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377, 2018.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- Sylvain Jeaugey. NCCL 2.0. https://on-demand.gputechconf.com/gtc/2017/presentation/s7155-jeaugey-nccl.pdf, 2017.
- A review on genetic algorithm: past, present, and future. Multimedia tools and applications, 80:8091–8126, 2021.
- Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on operating systems design and implementation (OSDI 14), pages 583–598, 2014.
- Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
- Communication efficient decentralized training with multiple local updates. arXiv preprint arXiv:1910.09126, 5:6, 2019.
- Asynchronous parallel stochastic gradient for nonconvex optimization. Advances in neural information processing systems, 28, 2015.
- Mlperf training benchmark. Proceedings of Machine Learning and Systems, 2:336–349, 2020.
- Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
- NVIDIA. Collective Communications Library (NCCL). https://developer.nvidia.com/nccl, 2019.
- Pascutto, Gian-Carlo. Leela zero.
- Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2):117–124, 2009.
- Rolf Rabenseifner. Optimization of collective reduction operations. In Computational Science-ICCS 2004: 4th International Conference, Kraków, Poland, June 6-9, 2004, Proceedings, Part I 4, pages 1–9. Springer, 2004.
- Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems, 24, 2011.
- Moshpit sgd: Communication-efficient decentralized training on heterogeneous unreliable devices. Advances in Neural Information Processing Systems, 34:18195–18211, 2021.
- Towards crowdsourced training of large neural networks using decentralized mixture-of-experts. Advances in Neural Information Processing Systems, 33:3659–3672, 2020.
- Optimal convergence rates for convex distributed optimization in networks. Journal of Machine Learning Research, 20:1–31, 2019.
- Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
- Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112):1–49, 2019.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Large-message size allreduce at wire speed for distributed deep learning. In Poster session presented at SC18, and Analysis, 2018.
- Exhaustive study of hierarchical allreduce patterns for large messages between gpus. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 430–439. IEEE, 2019.
- A dual approach for optimal algorithms in distributed optimization over networks. In 2020 Information Theory and Applications Workshop (ITA), pages 1–37. IEEE, 2020.
- Leslie G Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, 1990.
- Machine learning model sizes and the parameter gap. arXiv preprint arXiv:2207.02852, 2022.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Deploying and scaling distributed parallel deep neural networks on the tianhe-3 prototype system. Scientific Reports, 11(1):20244, 2021.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
- On the acceleration of deep learning model parallelism with staleness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2088–2097, 2020.
- A (DP)2superscriptDP2(\text{DP})^{2}( DP ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT sgd: Asynchronous decentralized parallel stochastic gradient descent with differential privacy. IEEE transactions on pattern analysis and machine intelligence, 44(11):8036–8047, 2021.
- Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems, 3:269–296, 2021.
- Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
- Staleness-aware async-sgd for distributed deep learning. arXiv preprint arXiv:1511.05950, 2015.
- Distributed hierarchical gpu parameter server for massive scale deep learning ads systems. Proceedings of Machine Learning and Systems, 2:412–428, 2020.
- Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
- Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning, pages 4120–4129. PMLR, 2017.
- On optimizing the communication of model parallelism. Proceedings of Machine Learning and Systems, 5, 2023.