Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distributed Random Reshuffling Methods with Improved Convergence (2306.12037v3)

Published 21 Jun 2023 in math.OC, cs.LG, and cs.MA

Abstract: This paper proposes two distributed random reshuffling methods, namely Gradient Tracking with Random Reshuffling (GT-RR) and Exact Diffusion with Random Reshuffling (ED-RR), to solve the distributed optimization problem over a connected network, where a set of agents aim to minimize the average of their local cost functions. Both algorithms invoke random reshuffling (RR) update for each agent, inherit favorable characteristics of RR for minimizing smooth nonconvex objective functions, and improve the performance of previous distributed random reshuffling methods both theoretically and empirically. Specifically, both GT-RR and ED-RR achieve the convergence rate of $O(1/[(1-\lambda){1/3}m{1/3}T{2/3}])$ in driving the (minimum) expected squared norm of the gradient to zero, where $T$ denotes the number of epochs, $m$ is the sample size for each agent, and $1-\lambda$ represents the spectral gap of the mixing matrix. When the objective functions further satisfy the Polyak-{\L}ojasiewicz (PL) condition, we show GT-RR and ED-RR both achieve $O(1/[(1-\lambda)mT2])$ convergence rate in terms of the averaged expected differences between the agents' function values and the global minimum value. Notably, both results are comparable to the convergence rates of centralized RR methods (up to constant factors depending on the network topology) and outperform those of previous distributed random reshuffling algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Y. Lu and C. De Sa, “Optimal complexity in decentralized training,” in International Conference on Machine Learning.   PMLR, 2021, pp. 7111–7123.
  2. A. Nedić, A. Olshevsky, and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018.
  3. W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
  4. S. Pu and A. Nedić, “Distributed stochastic gradient tracking methods,” Mathematical Programming, vol. 187, no. 1, pp. 409–457, 2021.
  5. S. Pu, A. Olshevsky, and I. C. Paschalidis, “A sharp estimate on the transient time of distributed stochastic gradient descent,” IEEE Transactions on Automatic Control, 2021.
  6. H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D2: Decentralized training over decentralized data,” in International Conference on Machine Learning, 2018, pp. 4848–4856.
  7. K. Yuan, S. A. Alghunaim, B. Ying, and A. H. Sayed, “On the influence of bias-correction on distributed stochastic optimization,” IEEE Transactions on Signal Processing, vol. 68, pp. 4352–4367, 2020.
  8. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in NIPS, 2017, pp. 5336–5346.
  9. R. Xin, U. A. Khan, and S. Kar, “An improved convergence analysis for decentralized online stochastic non-convex optimization,” IEEE Transactions on Signal Processing, vol. 69, pp. 1842–1858, 2021.
  10. Z. Li, W. Shi, and M. Yan, “A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates,” IEEE Transactions on Signal Processing, vol. 67, no. 17, pp. 4494–4506, 2019.
  11. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  12. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  13. L. Bottou, “Curiously fast convergence of some stochastic gradient descent algorithms,” in Proceedings of the symposium on learning and data science, Paris, vol. 8, 2009, pp. 2624–2633.
  14. ——, “Stochastic gradient descent tricks,” in Neural networks: Tricks of the trade.   Springer, 2012, pp. 421–436.
  15. K. Mishchenko, A. Khaled Ragab Bayoumi, and P. Richtárik, “Random reshuffling: Simple analysis with vast improvements,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  16. L. M. Nguyen, Q. Tran-Dinh, D. T. Phan, P. H. Nguyen, and M. Van Dijk, “A unified convergence analysis for shuffling-type gradient methods,” The Journal of Machine Learning Research, vol. 22, no. 1, pp. 9397–9440, 2021.
  17. C. Yun, S. Rajput, and S. Sra, “Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=LdlwbBP2mlq
  18. K. Huang, X. Li, A. Milzarek, S. Pu, and J. Qiu, “Distributed random reshuffling over networks,” IEEE Transactions on Signal Processing, vol. 71, pp. 1143–1158, 2023.
  19. A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  20. S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,” IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022.
  21. H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Joint European conference on machine learning and knowledge discovery in databases.   Springer, 2016, pp. 795–811.
  22. A. Koloskova, T. Lin, and S. U. Stich, “An improved analysis of gradient tracking for decentralized machine learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 422–11 435, 2021.
  23. S. A. Alghunaim and K. Yuan, “An enhanced gradient-tracking bound for distributed online stochastic convex optimization,” arXiv preprint arXiv:2301.02855, 2023.
  24. Z. Song, L. Shi, S. Pu, and M. Yan, “Optimal gradient tracking for decentralized optimization,” arXiv preprint arXiv:2110.05282, 2021.
  25. K. Yuan and S. A. Alghunaim, “Removing data heterogeneity influence enhances network topology dependence of decentralized sgd,” 2021.
  26. K. Huang and S. Pu, “Improving the transient times for distributed stochastic gradient methods,” IEEE Transactions on Automatic Control, 2022.
  27. K. Huang, X. Li, and S. Pu, “Distributed stochastic optimization under a general variance condition,” arXiv preprint arXiv:2301.12677, 2023.
  28. Y. Huang, Y. Sun, Z. Zhu, C. Yan, and J. Xu, “Tackling data heterogeneity: A new unified framework for decentralized SGD with sample-induced topology,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162.   PMLR, 17–23 Jul 2022, pp. 9310–9345. [Online]. Available: https://proceedings.mlr.press/v162/huang22i.html
  29. B. Ying, K. Yuan, S. Vlaski, and A. H. Sayed, “Stochastic learning under random reshuffling with constant step-sizes,” IEEE Transactions on Signal Processing, vol. 67, no. 2, pp. 474–489, 2018.
  30. M. Gürbüzbalaban, A. Ozdaglar, and P. Parrilo, “Why random reshuffling beats stochastic gradient descent,” Math. Program., vol. 186, no. 1-2, pp. 49–84, 2021.
  31. J. Z. HaoChen and S. Sra, “Random shuffling beats SGD after finite epochs,” in International Conference on Machine Learning, 2019, pp. 2624–2633.
  32. X. Li, A. Milzarek, and J. Qiu, “Convergence of random reshuffling under the kurdyka-{{\{{\\\backslash\L}}\}} ojasiewicz inequality,” arXiv preprint arXiv:2110.04926, 2021.
  33. J. Cha, J. Lee, and C. Yun, “Tighter lower bounds for shuffling sgd: Random permutations and beyond,” arXiv preprint arXiv:2303.07160, 2023.
  34. K. Yuan, B. Ying, J. Liu, and A. H. Sayed, “Variance-reduced stochastic learning by networked agents under random reshuffling,” IEEE Transactions on Signal Processing, vol. 67, no. 2, pp. 351–366, 2018.
  35. X. Jiang, X. Zeng, J. Sun, J. Chen, and L. Xie, “Distributed stochastic proximal algorithm with random reshuffling for nonsmooth finite-sum optimization,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  36. K. Yuan, B. Ying, X. Zhao, and A. H. Sayed, “Exact diffusion for distributed optimization and learning—part i: Algorithm development,” IEEE Transactions on Signal Processing, vol. 67, no. 3, pp. 708–723, 2018.
  37. S. J. Reddi, S. Sra, B. Póczos, and A. Smola, “Fast incremental method for smooth nonconvex optimization,” in 2016 IEEE 55th conference on decision and control (CDC).   IEEE, 2016, pp. 1971–1977.
  38. R. Xin, U. A. Khan, and S. Kar, “A fast randomized incremental gradient method for decentralized non-convex optimization,” IEEE Transactions on Automatic Control, pp. 1–1, 2021.
  39. R. Xin, U. Khan, and S. Kar, “A hybrid variance-reduced method for decentralized stochastic non-convex optimization,” in International Conference on Machine Learning.   PMLR, 2021, pp. 11 459–11 469.
  40. T. Pan, J. Liu, and J. Wang, “D-spider-sfo: A decentralized optimization algorithm with faster convergence rate for nonconvex problems,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 02, 2020, pp. 1619–1626.
  41. H. Sun, S. Lu, and M. Hong, “Improving the sample and communication complexity for decentralized non-convex optimization: Joint gradient estimation and tracking,” in International conference on machine learning.   PMLR, 2020, pp. 9217–9228.
  42. A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized sgd with changing topology and local updates,” pp. 5381–5393, 2020.
  43. J. Xu, Y. Tian, Y. Sun, and G. Scutari, “Distributed algorithms for composite optimization: Unified framework and convergence analysis,” IEEE Transactions on Signal Processing, 2021.
  44. S. A. Alghunaim, E. K. Ryu, K. Yuan, and A. H. Sayed, “Decentralized proximal gradient algorithms with linear convergence rates,” IEEE Transactions on Automatic Control, vol. 66, no. 6, pp. 2787–2794, 2021.
  45. H. Ye and X. Chang, “Snap-shot decentralized stochastic gradient tracking methods,” arXiv preprint arXiv:2212.05273, 2022.
  46. X. Zhang, M. Hong, S. Dhople, W. Yin, and Y. Liu, “Fedpd: A federated learning framework with adaptivity to non-iid data,” IEEE Transactions on Signal Processing, vol. 69, pp. 6055–6070, 2021.
  47. M. I. Qureshi, R. Xin, S. Kar, and U. A. Khan, “S-addopt: Decentralized stochastic first-order optimization over directed graphs,” IEEE Control Systems Letters, vol. 5, no. 3, pp. 953–958, 2020.
  48. J. Xu, S. Zhu, Y. C. Soh, and L. Xie, “Convergence of asynchronous distributed gradient methods over stochastic networks,” IEEE Transactions on Automatic Control, vol. 63, no. 2, pp. 434–448, 2017.
  49. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  50. K. Huang and S. Pu, “Cedas: A compressed decentralized stochastic gradient method with improved convergence,” 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.