ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less (2301.08895v6)
Abstract: Wall-clock convergence time and communication rounds are critical performance metrics in distributed learning with parameter-server setting. While synchronous methods converge fast but are not robust to stragglers; and asynchronous ones can reduce the wall-clock time per round but suffers from degraded convergence rate due to the staleness of gradients, it is natural to combine the two methods to achieve a balance. In this work, we develop a novel asynchronous strategy that leverages the advantages of both synchronous methods and asynchronous ones, named adaptive bounded staleness (ABS). The key enablers of ABS are two-fold. First, the number of workers that the PS waits for per round for gradient aggregation is adaptively selected to strike a straggling-staleness balance. Second, the workers with relatively high staleness are required to start a new round of computation to alleviate the negative effect of staleness. Simulation results are provided to demonstrate the superiority of ABS over state-of-the-art schemes in terms of wall-clock time and communication rounds.
- L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proc. COMPSTAT, 2010, pp. 177–186.
- J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Ng, “Large scale distributed deep networks,” in NIPS, 2012.
- C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, “Image classification at supercomputer scale,” ArXiv, vol. abs/1811.06992, 2018.
- M. Yamazaki, A. Kasagi, A. Tabuchi, T. Honda, M. Miwa, N. Fukumoto, T. Tabaru, A. Ike, and K. Nakashima, “Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds,” ArXiv, vol. abs/1903.12650, 2019.
- P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” ArXiv, vol. abs/1706.02677, 2017.
- H. Mikami, H. Suganuma, P. U.-Chupala, Y. Tanaka, and Y. Kageyama, “Imagenet/resnet-50 training in 224 seconds,” ArXiv, vol. abs/1811.05233, 2018.
- J. Chen, R. Monga, S. Bengio, and R. Józefowicz, “Revisiting distributed synchronous sgd,” ArXiv, vol. abs/1702.05800, 2016.
- A. Agarwal and J. C. Duchi, “Distributed delayed stochastic optimization,” in Proc. Neural Inf. Process. Syst., vol. 24, 2011, pp. 873–881.
- S. U. Stich, “Local sgd converges fast and communicates little,” ArXiv, vol. abs/1805.09767, 2019.
- W. Zhang, S. Gupta, X. Lian, and J. Liu, “Staleness-aware async-sgd for distributed deep learning,” in IJCAI, 2016.
- J. Jiang, B. Cui, C. Zhang, and L. Yu, “Heterogeneity-aware distributed parameter servers,” Proceedings of the 2017 ACM International Conference on Management of Data, 2017.
- C. Hardy, E. L. Merrer, and B. Sericola, “Distributed deep learning on edge-devices: Feasibility via adaptive compression,” 2017 IEEE 16th International Symposium on Network Computing and Applications (NCA), pp. 1–8, 2017.
- S. Barkai, I. Hakimi, and A. Schuster, “Gap aware mitigation of gradient staleness,” ArXiv, vol. abs/1909.10802, 2019.
- A. F. Aji and K. Heafield, “Making asynchronous stochastic gradient descent work for transformers,” in EMNLP, 2019.
- S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar, “Slow and stale gradients can win the race,” IEEE Journal on Selected Areas in Information Theory, vol. 2, pp. 1012–1024, 2021.
- G. F. Coppola, “Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing,” 2015.
- J. Zhang, C. D. Sa, I. Mitliagkas, and C. Ré, “Parallel sgd: When does averaging help?” ArXiv, vol. abs/1606.07365, 2016.
- H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd for non-convex optimization with faster convergence and less communication,” ArXiv, vol. abs/1807.06629, 2018.
- K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu, “Gaia: Geo-distributed machine learning approaching lan speeds,” in NSDI, 2017.
- D. Basu, D. Data, C. B. Karakuş, and S. N. Diggavi, “Qsparse-local-sgd: Distributed sgd with quantization, sparsification, and local computations,” IEEE Journal on Selected Areas in Information Theory, vol. 1, pp. 217–226, 2019.
- S. Ali, H. J. Siegel, M. Maheswaran, D. A. Hensgen, and S. Ali, “Task execution time modeling for heterogeneous computing systems,” Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556), pp. 185–199, 2000.