Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DIGEST: Fast and Communication Efficient Decentralized Learning with Local Updates (2307.07652v2)

Published 14 Jul 2023 in cs.LG and cs.DC

Abstract: Two widely considered decentralized learning algorithms are Gossip and random walk-based learning. Gossip algorithms (both synchronous and asynchronous versions) suffer from high communication cost, while random-walk based learning experiences increased convergence time. In this paper, we design a fast and communication-efficient asynchronous decentralized learning mechanism DIGEST by taking advantage of both Gossip and random-walk ideas, and focusing on stochastic gradient descent (SGD). DIGEST is an asynchronous decentralized algorithm building on local-SGD algorithms, which are originally designed for communication efficient centralized learning. We design both single-stream and multi-stream DIGEST, where the communication overhead may increase when the number of streams increases, and there is a convergence and communication overhead trade-off which can be leveraged. We analyze the convergence of single- and multi-stream DIGEST, and prove that both algorithms approach to the optimal solution asymptotically for both iid and non-iid data distributions. We evaluate the performance of single- and multi-stream DIGEST for logistic regression and a deep neural network ResNet20. The simulation results confirm that multi-stream DIGEST has nice convergence properties; i.e., its convergence time is better than or comparable to the baselines in iid setting, and outperforms the baselines in non-iid setting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of things (iot): A vision, architectural elements, and future directions,” Future generation computer systems, vol. 29, no. 7, pp. 1645–1660, 2013.
  2. H. Li, K. Ota, and M. Dong, “Learning iot in edge: Deep learning for the internet of things with edge computing,” IEEE network, vol. 32, no. 1, pp. 96–101, 2018.
  3. O. H. Milani, S. A. Motamedi, S. Sharifian, and M. Nazari-Heris, “Intelligent service selection in a multi-dimensional environment of cloud providers for internet of things stream data through cloudlets,” Energies, vol. 14, no. 24, p. 8601, 2021.
  4. H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federated Learning of Deep Networks using Model Averaging,” CoRR, vol. abs/1602.05629, 2016.
  5. P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, H. Eichner, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascon, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecny, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Ozgur, R. Pagh, H. Qi, D. Ramage, R. Raskar, M. Raykova, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramer, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao, “Advances and open problems in federated learning,” Foundations and Trends in Machine Learning, vol. 14, no. 1–2, pp. 1–210, 2021.
  6. J. Konecný, H. B. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” ArXiv, vol. abs/1511.03575, 2015.
  7. H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in AISTATS, 2017.
  8. T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020.
  9. X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of fedavg on non-iid data,” ArXiv, vol. abs/1907.02189, 2020.
  10. S. P. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,” IEEE Transactions on Information Theory, vol. 52, pp. 2508–2530, 2006.
  11. A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, pp. 48–61, 2009.
  12. A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressed communication,” ArXiv, vol. abs/1902.00340, 2019.
  13. T. Aysal, M. E. Yildiz, A. Sarwate, and A. Scaglione, “Broadcast gossip algorithms for consensus,” IEEE Transactions on Signal Processing, vol. 57, pp. 2748–2761, 2009.
  14. J. C. Duchi, A. Agarwal, and M. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic Control, vol. 57, pp. 592–606, 2012.
  15. D. Kempe, A. Dobra, and J. Gehrke, “Gossip-based computation of aggregate information,” in Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science, ser. FOCS ’03.   USA: IEEE Computer Society, 2003, p. 482.
  16. L. Xiao and S. P. Boyd, “Fast linear iterations for distributed averaging,” 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475), vol. 5, pp. 4997–5002 Vol.5, 2003.
  17. S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah, “Randomized gossip algorithms,” IEEE Transactions on Information Theory, vol. 52, no. 6, pp. 2508–2530, 2006.
  18. X. Lian, W. Zhang, C. Zhang, and J. Liu, “Asynchronous decentralized parallel stochastic gradient descent.” in ICML, ser. Proceedings of Machine Learning Research, J. G. Dy and A. Krause, Eds., vol. 80.   PMLR, 2018, pp. 3049–3058.
  19. M. Assran, N. Loizou, N. Ballas, and M. G. Rabbat, “Stochastic gradient push for distributed deep learning,” 2019.
  20. Y. Li, M. Yu, S. Li, A. S. Avestimehr, N. S. Kim, and A. G. Schwing, “Pipe-sgd: A decentralized pipelined sgd framework for distributed deep net training,” ArXiv, vol. abs/1811.03619, 2018.
  21. T. Avidor and N. Tal-Israel, “Locally asynchronous stochastic gradient descent for decentralised deep learning,” ArXiv, vol. abs/2203.13085, 2022.
  22. S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and stale gradients can win the race,” IEEE Journal on Selected Areas in Information Theory, vol. 2, pp. 1012–1024, 2021.
  23. D. P. Bertsekas, “A new class of incremental gradient methods for least squares problems,” SIAM J. Optim, vol. 7, pp. 913–926, 1996.
  24. G. Ayache and S. E. Rouayheb, “Private weighted random walk stochastic gradient descent,” IEEE Journal on Selected Areas in Information Theory, vol. 2, no. 1, pp. 452–463, 2021.
  25. T. Sun, Y. Sun, and W. Yin, “On markov chain gradient descent,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18.   Red Hook, NY, USA: Curran Associates Inc., 2018, p. 9918–9927.
  26. D. Needell, N. Srebro, and R. Ward, “Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm,” in Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’14.   Cambridge, MA, USA: MIT Press, 2014, p. 1017–1025.
  27. S. U. Stich, “Local sgd converges fast and communicates little,” ArXiv, vol. abs/1805.09767, 2019.
  28. J. Wang and G. Joshi, “Cooperative sgd: A unified framework for the design and analysis of local-update sgd algorithms,” Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021. [Online]. Available: http://jmlr.org/papers/v22/20-147.html
  29. T. Lin, S. U. Stich, and M. Jaggi, “Don’t use large mini-batches, use local sgd,” ArXiv, vol. abs/1808.07217, 2020.
  30. C. L. Hedrick, “Routing information protocol,” RFC, vol. 1058, pp. 1–33, 1988.
  31. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015.
  32. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  33. A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
  34. J. N. Tsitsiklis, “Problems in decentralized decision making and computation,” Ph.D. dissertation, Massachusetts Institute of Technology, 1984.
  35. A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
  36. J. Chen and A. H. Sayed, “Diffusion adaptation strategies for distributed optimization and learning over networks,” IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4289–4305, 2012.
  37. J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic Control, vol. 57, no. 3, pp. 592–606, 2012.
  38. K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016. [Online]. Available: https://doi.org/10.1137/130943170
  39. A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A unified theory of decentralized SGD with changing topology and local updates,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119.   PMLR, 13–18 Jul 2020, pp. 5381–5393. [Online]. Available: https://proceedings.mlr.press/v119/koloskova20a.html
  40. K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié, “Optimal convergence rates for convex distributed optimization in networks,” Journal of Machine Learning Research, vol. 20, no. 159, pp. 1–31, 2019. [Online]. Available: http://jmlr.org/papers/v20/19-543.html
  41. L. Giaretta and S. Girdzijauskas, “Gossip learning: Off the beaten path,” in 2019 IEEE International Conference on Big Data (Big Data).   Los Alamitos, CA, USA: IEEE Computer Society, dec 2019, pp. 1117–1124. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/BigData47090.2019.9006216
  42. G. Nadiradze, A. Sabour, P. Davies, I. Markov, S. Li, and D. Alistarh, “Decentralized sgd with asynchronous, local and quantized updates.” arXiv: Learning, 2020.
  43. A. Spiridonoff, A. Olshevsky, and I. Paschalidis, “Communication-efficient SGD: From local SGD to one-shot averaging,” in Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021. [Online]. Available: https://openreview.net/forum?id=UpfqzQtZ58
  44. R. Bellman, “On a routing problem,” Quarterly of Applied Mathematics, vol. 16, pp. 87–90, 1958. [Online]. Available: https://api.semanticscholar.org/CorpusID:123639971
  45. Y. Liu, T. Lin, A. Koloskova, and S. U. Stich, “Decentralized gradient tracking with local steps,” 2023.
  46. “Digest codes,” 2023, available at https://www.dropbox.com/s/sowfdwfj0chs1z0/DIGEST-codes.zip?dl=0    and https://github.com/Anonymous404404/DigestCode.git.
  47. A. Spiridonoff, A. Olshevsky, and Y. Paschalidis, “Communication-efficient sgd: From local sgd to one-shot averaging,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 313–24 326, 2021.
  48. S. U. Stich and S. P. Karimireddy, “The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication,” 2019. [Online]. Available: https://arxiv.org/abs/1909.05350
  49. S. U. Stich, “Unified optimal analysis of the (stochastic) gradient method,” CoRR, vol. abs/1907.04232, 2019. [Online]. Available: http://arxiv.org/abs/1907.04232
Citations (9)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets