Papers
Topics
Authors
Recent
Search
2000 character limit reached

On the Optimal Time Complexities in Decentralized Stochastic Asynchronous Optimization

Published 25 May 2024 in math.OC | (2405.16218v2)

Abstract: We consider the decentralized stochastic asynchronous optimization setup, where many workers asynchronously calculate stochastic gradients and asynchronously communicate with each other using edges in a multigraph. For both homogeneous and heterogeneous setups, we prove new time complexity lower bounds under the assumption that computation and communication speeds are bounded. We develop a new nearly optimal method, Fragile SGD, and a new optimal method, Amelie SGD, that converge under arbitrary heterogeneous computation and communication speeds and match our lower bounds (up to a logarithmic factor in the homogeneous setting). Our time complexities are new, nearly optimal, and provably improve all previous asynchronous/synchronous stochastic methods in the decentralized setup.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Lower bounds for non-convex stochastic optimization. Mathematical Programming, pages 1–50.
  2. SWIFT: Rapid decentralized federated learning via wait-free model communication. In The 11th International Conference on Learning Representations (ICLR).
  3. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120.
  4. Asynchronous stochastic optimization robust to arbitrary delays. Advances in Neural Information Processing Systems, 34:9024–9035.
  5. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592–606.
  6. Asynchronous SGD on graphs: A unified framework for asynchronous decentralized and federated optimization. In Artificial Intelligence and Statistics. PMLR.
  7. Floyd, R. W. (1962). Algorithm 97: shortest path. Communications of the ACM, 5(6):345–345.
  8. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368.
  9. Lower bounds and nearly optimal algorithms in distributed learning with communication compression. Advances in Neural Information Processing Systems (NeurIPS).
  10. Better theory for SGD in the nonconvex world. Transactions on Machine Learning Research.
  11. Koloskova, A. (2024). Optimization algorithms for decentralized, distributed and collaborative machine learning. Technical report, EPFL.
  12. An improved analysis of gradient tracking for decentralized machine learning. Advances in Neural Information Processing Systems, 34:11422–11435.
  13. Sharper convergence guarantees for asynchronous SGD for distributed and federated learning. Advances in Neural Information Processing Systems (NeurIPS).
  14. Lan, G. (2020). First-order and stochastic optimization methods for machine learning. Springer.
  15. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pages 3043–3052. PMLR.
  16. Decentralized gradient tracking with local steps. Optimization Methods and Software, pages 1–28.
  17. Optimal complexity in decentralized training. In International Conference on Machine Learning, pages 7111–7123. PMLR.
  18. Asynchronous SGD beats minibatch SGD under arbitrary delays. Advances in Neural Information Processing Systems (NeurIPS).
  19. Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate o (1/k** 2). Doklady Akademii Nauk SSSR, 269(3):543.
  20. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In International Conference on Machine Learning, pages 3027–3036. PMLR.
  21. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013–6023.
  22. Shadowheart SGD: Distributed asynchronous SGD with optimal time complexity under arbitrary computation and communication heterogeneity. arXiv preprint arXiv:2402.04785.
  23. Optimal time complexities of parallel stochastic optimization methods under a fixed computation model. Advances in Neural Information Processing Systems (NeurIPS).
  24. Van Handel, R. (2014). Probability in high dimension. Lecture Notes (Princeton University).
  25. RelaySum for decentralized deep learning on heterogeneous data. Advances in Neural Information Processing Systems, 34.
  26. A survey of distributed optimization. Annual Reviews in Control, 47:278–305.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 4 likes about this paper.