Adjacent Leader Decentralized Stochastic Gradient Descent (2405.11389v2)
Abstract: This work focuses on the decentralized deep learning optimization framework. We propose Adjacent Leader Decentralized Gradient Descent (AL-DSGD), for improving final model performance, accelerating convergence, and reducing the communication overhead of decentralized deep learning optimizers. AL-DSGD relies on two main ideas. Firstly, to increase the influence of the strongest learners on the learning system it assigns weights to different neighbor workers according to both their performance and the degree when averaging among them, and it applies a corrective force on the workers dictated by both the currently best-performing neighbor and the neighbor with the maximal degree. Secondly, to alleviate the problem of the deterioration of the convergence speed and performance of the nodes with lower degrees, AL-DSGD relies on dynamic communication graphs, which effectively allows the workers to communicate with more nodes while keeping the degrees of the nodes low. Experiments demonstrate that AL-DSGD accelerates the convergence of the decentralized state-of-the-art techniques and improves their test performance especially in the communication constrained environments. We also theoretically prove the convergence of the proposed scheme. Finally, we release to the community a highly general and concise PyTorch-based library for distributed training of deep learning models that supports easy implementation of any distributed deep learning approach ((a)synchronous, (de)centralized).
- S. A. Alghunaim and K. Yuan. A unified and refined convergence analysis for non-convex decentralized learning. IEEE Transactions on Signal Processing, 70:3264–3279, 2022.
- The mixing time of the giant component of a random graph. Random Structures & Algorithms, 45(3):383–407, 2014.
- Balancing communication and computation in distributed optimization. IEEE Transactions on Automatic Control, 64(8):3141–3155, 2018.
- Gossip training for deep learning. arXiv preprint arXiv:1611.09726, 2016.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Accelerating gossip sgd with periodic global averaging. In International Conference on Machine Learning, pages 1791–1802. PMLR, 2021.
- Expander graph and communication-efficient decentralized optimization. In 2016 50th Asilomar Conference on Signals, Systems and Computers, pages 1715–1720. IEEE, 2016.
- Exploiting bounded staleness to speed up big data analytics. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 37–48, 2014.
- J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74–80, 2013.
- Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012.
- Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592–606, 2011.
- Short-dot: Computing large linear transforms distributedly using coded short dot products. Advances In Neural Information Processing Systems, 29, 2016.
- S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- H. He and P. Dube. Accelerating parallel stochastic gradient descent via non-blocking mini-batches. arXiv preprint arXiv:2211.00889, 2022a.
- H. He and P. Dube. Rcd-sgd: Resource-constrained distributed sgd in heterogeneous environment via submodular partitioning. arXiv preprint arXiv:2211.00839, 2022b.
- K. Huang and S. Pu. Improving the transient times for distributed stochastic gradient methods. IEEE Transactions on Automatic Control, 2022.
- Convergence rates for distributed stochastic optimization over random networks. In 2018 IEEE Conference on Decision and Control (CDC), pages 4238–4245. IEEE, 2018.
- How to scale distributed deep learning? arXiv preprint arXiv:1611.04581, 2016.
- A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pages 5381–5393. PMLR, 2020.
- Consensus control for decentralized deep learning. In International Conference on Machine Learning, pages 5686–5696. PMLR, 2021.
- Learning multiple layers of features from tiny images. 2009.
- Convergence rates analysis of the quadratic penalty method and its applications to decentralized distributed optimization. arXiv preprint arXiv:1711.10802, 2017.
- Communication efficient distributed machine learning with the parameter server. Advances in Neural Information Processing Systems, 27, 2014.
- Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
- Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in Neural Information Processing Systems, 30, 2017.
- Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pages 3043–3052. PMLR, 2018.
- A decentralized parallel algorithm for training generative adversarial nets. Advances in Neural Information Processing Systems, 33:11056–11070, 2020.
- Lipschitz adaptivity with multiple learning rates in online learning. In Conference on Learning Theory, pages 2490–2511. PMLR, 2019.
- A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
- Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953–976, 2018.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Optimal algorithms for non-smooth distributed optimization in networks. Advances in Neural Information Processing Systems, 31, 2018.
- d2superscript𝑑2d^{2}italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Decentralized training over decentralized data. In International Conference on Machine Learning, pages 4848–4856. PMLR, 2018.
- Leader stochastic gradient descent for distributed training of deep learning models. Advances in Neural Information Processing Systems, 32, 2019.
- Excess-risk of distributed stochastic learners. IEEE Transactions on Information Theory, 62(10):5753–5785, 2016.
- J. Wang and G. Joshi. Cooperative sgd: A unified framework for the design and analysis of local-update sgd algorithms. The Journal of Machine Learning Research, 22(1):9709–9758, 2021.
- Matcha: Speeding up decentralized sgd via matching decomposition sampling. In 2019 Sixth Indian Control Conference (ICC), pages 299–300. IEEE, 2019a.
- Slowmo: Improving communication-efficient distributed sgd with slow momentum. arXiv preprint arXiv:1910.00643, 2019b.
- Matcha: A matching-based link scheduling strategy to speed up distributed optimization. IEEE Transactions on Signal Processing, 70:5208–5221, 2022.
- Wngrad: Learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865, 2018.
- Lipschitzlr: Using theoretically computed adaptive learning rates for fast convergence. Applied Intelligence, 51:1460–1478, 2021.
- On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016.
- On the influence of bias-correction on distributed stochastic optimization. IEEE Transactions on Signal Processing, 68:4352–4367, 2020.
- Removing data heterogeneity influence enhances network topology dependence of decentralized sgd. arXiv preprint arXiv:2105.08023, 2021a.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021b.
- J. Zeng and W. Yin. On nonconvex decentralized gradient descent. IEEE Transactions on signal processing, 66(11):2834–2848, 2018.
- Deep learning with elastic averaging sgd. Advances in neural information processing systems, 28, 2015.