Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CEDAS: A Compressed Decentralized Stochastic Gradient Method with Improved Convergence (2301.05872v3)

Published 14 Jan 2023 in math.OC, cs.DC, cs.LG, and cs.MA

Abstract: In this paper, we consider solving the distributed optimization problem over a multi-agent network under the communication restricted setting. We study a compressed decentralized stochastic gradient method, termed ``compressed exact diffusion with adaptive stepsizes (CEDAS)", and show the method asymptotically achieves comparable convergence rate as centralized { stochastic gradient descent (SGD)} for both smooth strongly convex objective functions and smooth nonconvex objective functions under unbiased compression operators. In particular, to our knowledge, CEDAS enjoys so far the shortest transient time (with respect to the graph specifics) for achieving the convergence rate of centralized SGD, which behaves as $\mathcal{O}(n{C3}/(1-\lambda_2){2})$ under smooth strongly convex objective functions, and $\mathcal{O}(n3{C6}/(1-\lambda_2)4)$ under smooth nonconvex objective functions, where $(1-\lambda_2)$ denotes the spectral gap of the mixing matrix, and $C>0$ is the compression-related parameter. In particular, CEDAS exhibits the shortest transient times when $C < \mathcal{O}(1/(1 - \lambda_2)2)$, which is common in practice. Numerical experiments further demonstrate the effectiveness of the proposed algorithm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. A. Nedić, A. Olshevsky, and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018.
  2. A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi, “Decentralized deep learning with arbitrary communication compression,” arXiv preprint arXiv:1907.09356, 2019.
  3. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  4. J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signsgd: Compressed optimisation for non-convex problems,” in International Conference on Machine Learning.   PMLR, 2018, pp. 560–569.
  5. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” in Fifteenth Annual Conference of the International Speech Communication Association.   Citeseer, 2014.
  6. J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsification for communication-efficient distributed optimization,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  7. S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  8. P. Richtárik, I. Sokolov, and I. Fatkhullin, “Ef21: A new, simpler, theoretically better, and practically faster error feedback,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  9. D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, “The convergence of sparsified gradient methods,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  10. K. Mishchenko, E. Gorbunov, M. Takáč, and P. Richtárik, “Distributed learning with compressed gradient differences,” arXiv preprint arXiv:1901.09269, 2019.
  11. H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “Communication compression for decentralized training,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  12. A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.   PMLR, 09–15 Jun 2019, pp. 3478–3487. [Online]. Available: https://proceedings.mlr.press/v97/koloskova19a.html
  13. X. Liu, Y. Li, R. Wang, J. Tang, and M. Yan, “Linear convergent decentralized optimization with compression,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=84gjULz1t5
  14. Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4494–4506, 2019.
  15. Y. Liao, Z. Li, K. Huang, and S. Pu, “Compressed gradient tracking methods for decentralized optimization with linear convergence,” arXiv preprint arXiv:2103.13748, 2021.
  16. Z. Song, L. Shi, S. Pu, and M. Yan, “Compressed gradient tracking for decentralized optimization over general directed networks,” arXiv preprint arXiv:2106.07243, 2021.
  17. J. Zhang, K. You, and L. Xie, “Innovation compression for communication-efficient distributed optimization with linear convergence,” arXiv preprint arXiv:2105.06697, 2021.
  18. X. Yi, S. Zhang, T. Yang, T. Chai, and K. H. Johansson, “Communication compression for distributed nonconvex optimization,” arXiv preprint arXiv:2201.03930, 2022.
  19. S. Pu, A. Olshevsky, and I. C. Paschalidis, “Asymptotic network independence in distributed stochastic optimization for machine learning: Examining distributed and centralized stochastic gradient descent,” IEEE signal processing magazine, vol. 37, no. 3, pp. 114–122, 2020.
  20. N. Singh, D. Data, J. George, and S. Diggavi, “Sparq-sgd: Event-triggered and compressed communication in decentralized optimization,” IEEE Transactions on Automatic Control, 2022.
  21. ——, “Squarm-sgd: Communication-efficient momentum sgd for decentralized optimization,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 3, pp. 954–969, 2021.
  22. T. Vogels, S. P. Karimireddy, and M. Jaggi, “Practical low-rank communication compression in decentralized deep learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 171–14 181, 2020.
  23. A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  24. S. Pu, A. Olshevsky, and I. C. Paschalidis, “A sharp estimate on the transient time of distributed stochastic gradient descent,” IEEE Transactions on Automatic Control, pp. 1–1, 2021.
  25. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  26. W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
  27. K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning—part i: Algorithm development,” IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, 2018.
  28. N. Strom, “Scalable distributed dnn training using commodity gpu cloud computing,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  29. C. De Sa, M. Feldman, C. Ré, and K. Olukotun, “Understanding and optimizing asynchronous low-precision stochastic gradient descent,” in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 561–574.
  30. J. Konečnỳ and P. Richtárik, “Randomized distributed mean estimation: Accuracy vs. communication,” Frontiers in Applied Mathematics and Statistics, p. 62, 2018.
  31. H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu, “Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression,” in International Conference on Machine Learning.   PMLR, 2019, pp. 6155–6165.
  32. K. Mishchenko, B. Wang, D. Kovalev, and P. Richtárik, “Intsgd: Adaptive floatless compression of stochastic gradients,” in International Conference on Learning Representations, 2021.
  33. X. Huang, Y. Chen, W. Yin, and K. Yuan, “Lower bounds and nearly optimal algorithms in distributed learning with communication compression,” arXiv preprint arXiv:2206.03665, 2022.
  34. H. Tang, X. Lian, S. Qiu, L. Yuan, C. Zhang, T. Zhang, and J. Liu, “Deepsqueeze: Decentralization meets error-compensated compression,” arXiv preprint arXiv:1907.07346, 2019.
  35. Y. Li, X. Liu, J. Tang, M. Yan, and K. Yuan, “Decentralized composite optimization with compression,” arXiv preprint arXiv:2108.04448, 2021.
  36. G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms for decentralized and stochastic optimization,” Mathematical Programming, vol. 180, no. 1, pp. 237–284, 2020.
  37. D. Kovalev, A. Koloskova, M. Jaggi, P. Richtarik, and S. Stich, “A linearly convergent algorithm for decentralized optimization: Sending less bits for free!” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2021, pp. 4087–4095.
  38. H. Zhao, B. Li, Z. Li, P. Richtárik, and Y. Chi, “Beer: Fast o⁢(1/t)𝑜1𝑡o(1/t)italic_o ( 1 / italic_t ) rate for decentralized nonconvex optimization with communication compression,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 653–31 667, 2022.
  39. L. Xu, X. Yi, J. Sun, Y. Shi, K. H. Johansson, and T. Yang, “Quantized distributed nonconvex optimization algorithms with linear convergence,” arXiv preprint arXiv:2207.08106, 2022.
  40. Y. Xiong, L. Wu, K. You, and L. Xie, “Quantized distributed gradient tracking algorithm with linear convergence in directed networks,” arXiv preprint arXiv:2104.03649, 2021.
  41. N. Michelusi, G. Scutari, and C.-S. Lee, “Finite-bit quantization for distributed algorithms with linear convergence,” IEEE Transactions on Information Theory, 2022.
  42. K. Huang and S. Pu, “Improving the transient times for distributed stochastic gradient methods,” IEEE Transactions on Automatic Control, 2022.
  43. H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “d2superscript𝑑2d^{2}italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Decentralized training over decentralized data,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80.   PMLR, 10–15 Jul 2018, pp. 4848–4856. [Online]. Available: http://proceedings.mlr.press/v80/tang18a.html
  44. A. Spiridonoff, A. Olshevsky, and I. C. Paschalidis, “Robust asynchronous stochastic gradient-push: Asymptotically optimal and network-independent performance for strongly convex functions,” Journal of Machine Learning Research, vol. 21, no. 58, 2020.
  45. S. Pu and A. Nedić, “Distributed stochastic gradient tracking methods,” Mathematical Programming, vol. 187, no. 1, pp. 409–457, 2021.
  46. S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” arXiv preprint arXiv:2110.09993, 2021.
  47. K. Yuan and S. A. Alghunaim, “Removing data heterogeneity influence enhances network topology dependence of decentralized sgd,” arXiv preprint arXiv:2105.08023, 2021.
  48. A. Koloskova, T. Lin, and S. U. Stich, “An improved analysis of gradient tracking for decentralized machine learning,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  49. R. Xin, U. A. Khan, and S. Kar, “An improved convergence analysis for decentralized online stochastic non-convex optimization,” IEEE Transactions on Signal Processing, vol. 69, pp. 1842–1858, 2021.
  50. K. Yuan, X. Huang, Y. Chen, X. Zhang, Y. Zhang, and P. Pan, “Revisiting optimal convergence rate for smooth and non-convex stochastic decentralized optimization,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 382–36 395, 2022.
  51. Y. Lu and C. De Sa, “Optimal complexity in decentralized training,” in International Conference on Machine Learning.   PMLR, 2021, pp. 7111–7123.
  52. T. Xiao, X. Chen, K. Balasubramanian, and S. Ghadimi, “A one-sample decentralized proximal algorithm for non-convex stochastic composite optimization,” arXiv preprint arXiv:2302.09766, 2023.
  53. A. Beznosikov, S. Horváth, P. Richtárik, and M. Safaryan, “On biased compression for distributed learning,” arXiv preprint arXiv:2002.12410, 2020.
  54. M. Safaryan, E. Shulgin, and P. Richtárik, “Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor,” Information and Inference: A Journal of the IMA, vol. 11, no. 2, pp. 557–580, 2022.
  55. H. Xu, C.-Y. Ho, A. M. Abdelmoniem, A. Dutta, E. H. Bergou, K. Karatsenidis, M. Canini, and P. Kalnis, “Compressed communication for distributed deep learning: Survey and quantitative evaluation,” Tech. Rep., 2020.
  56. S. Horváth and P. Richtarik, “A better alternative to error feedback for communication-efficient distributed learning,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=vYVI1CHPaQg
  57. J. Xu, Y. Tian, Y. Sun, and G. Scutari, “Distributed algorithms for composite optimization: Unified framework and convergence analysis,” IEEE Transactions on Signal Processing, 2021.
  58. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Kun Huang (85 papers)
  2. Shi Pu (109 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.