Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less (2301.08895v6)

Published 21 Jan 2023 in cs.DC

Abstract: Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proc. COMPSTAT, 2010, pp. 177–186.
  2. J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Ng, “Large scale distributed deep networks,” in NIPS, 2012.
  3. C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, “Image classification at supercomputer scale,” ArXiv, vol. abs/1811.06992, 2018.
  4. M. Yamazaki, A. Kasagi, A. Tabuchi, T. Honda, M. Miwa, N. Fukumoto, T. Tabaru, A. Ike, and K. Nakashima, “Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds,” ArXiv, vol. abs/1903.12650, 2019.
  5. P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” ArXiv, vol. abs/1706.02677, 2017.
  6. H. Mikami, H. Suganuma, P. U.-Chupala, Y. Tanaka, and Y. Kageyama, “Imagenet/resnet-50 training in 224 seconds,” ArXiv, vol. abs/1811.05233, 2018.
  7. J. Chen, R. Monga, S. Bengio, and R. Józefowicz, “Revisiting distributed synchronous sgd,” ArXiv, vol. abs/1702.05800, 2016.
  8. A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimization,” in Proc. Neural Inf. Process. Syst., vol. 24, 2011, pp. 873–881.
  9. S. U. Stich, “Local sgd converges fast and communicates little,” ArXiv, vol. abs/1805.09767, 2019.
  10. W. Zhang, S. Gupta, X. Lian, and J. Liu, “Staleness-aware async-sgd for distributed deep learning,” in IJCAI, 2016.
  11. J. Jiang, B. Cui, C. Zhang, and L. Yu, “Heterogeneity-aware distributed parameter servers,” Proceedings of the 2017 ACM International Conference on Management of Data, 2017.
  12. C. Hardy, E. L. Merrer, and B. Sericola, “Distributed deep learning on edge-devices: Feasibility via adaptive compression,” 2017 IEEE 16th International Symposium on Network Computing and Applications (NCA), pp. 1–8, 2017.
  13. S. Barkai, I. Hakimi, and A. Schuster, “Gap aware mitigation of gradient staleness,” ArXiv, vol. abs/1909.10802, 2019.
  14. A. F. Aji and K. Heafield, “Making asynchronous stochastic gradient descent work for transformers,” in EMNLP, 2019.
  15. S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and stale gradients can win the race,” IEEE Journal on Selected Areas in Information Theory, vol. 2, pp. 1012–1024, 2021.
  16. G. F. Coppola, “Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing,” 2015.
  17. J. Zhang, C. D. Sa, I. Mitliagkas, and C. Ré, “Parallel sgd: When does averaging help?” ArXiv, vol. abs/1606.07365, 2016.
  18. H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd for non-convex optimization with faster convergence and less communication,” ArXiv, vol. abs/1807.06629, 2018.
  19. K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu, “Gaia: Geo-distributed machine learning approaching lan speeds,” in NSDI, 2017.
  20. D. Basu, D. Data, C. B. Karakuş, and S. N. Diggavi, “Qsparse-local-sgd: Distributed sgd with quantization, sparsification, and local computations,” IEEE Journal on Selected Areas in Information Theory, vol. 1, pp. 217–226, 2019.
  21. S. Ali, H. J. Siegel, M. Maheswaran, D. A. Hensgen, and S. Ali, “Task execution time modeling for heterogeneous computing systems,” Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556), pp. 185–199, 2000.

Summary

We haven't generated a summary for this paper yet.