Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey (2404.06114v1)

Published 9 Apr 2024 in cs.DC and cs.AI

Abstract: With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the large-scale scenario poses new challenges that include fault tolerance, scalability of algorithms and infrastructures, and heterogeneity in data sets, models, and resources. Due to intensive synchronization of models and sharing of data across GPUs and computing nodes during distributed training and inference processes, communication efficiency becomes the bottleneck for achieving high performance at a large scale. This article surveys the literature over the period of 2018-2023 on algorithms and technologies aimed at achieving efficient communication in large-scale distributed deep learning at various levels, including algorithms, frameworks, and infrastructures. Specifically, we first introduce efficient algorithms for model synchronization and communication data compression in the context of large-scale distributed training. Next, we introduce efficient strategies related to resource allocation and task scheduling for use in distributed training and inference. After that, we present the latest technologies pertaining to modern communication infrastructures used in distributed deep learning with a focus on examining the impact of the communication overhead in a large-scale and heterogeneous setting. Finally, we conduct a case study on the distributed training of LLMs at a large scale to illustrate how to apply these technologies in real cases. This article aims to offer researchers a comprehensive understanding of the current landscape of large-scale distributed deep learning and to reveal promising future research directions toward communication-efficient solutions in this scope.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (337)
  1. A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and N. Andrew, “Deep learning with cots hpc systems,” in International conference on machine learning.   PMLR, 2013, pp. 1337–1345.
  2. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  4. D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR 2015, 2015.
  5. J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
  6. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2020.
  7. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  8. T. Hollon, C. Jiang, A. Chowdury, M. Nasir-Moin, A. Kondepudi, A. Aabedi, A. Adapa, W. Al-Holou, J. Heth, O. Sagher et al., “Artificial-intelligence-based molecular classification of diffuse gliomas using rapid, label-free optical imaging,” Nature Medicine, vol. 29, no. 4, pp. 828–832, 2023.
  9. A. H. Thieme, Y. Zheng, G. Machiraju, C. Sadee, M. Mittermaier, M. Gertler, J. L. Salinas, K. Srinivasan, P. Gyawali, F. Carrillo-Perez et al., “A deep-learning algorithm to classify skin lesions from mpox virus infection,” Nature medicine, vol. 29, no. 3, pp. 738–747, 2023.
  10. C. Yan, J. Qin, Q. Liu, Q. Ma, and Y. Kang, “Mapless navigation with safety-enhanced imitation learning,” IEEE Transactions on Industrial Electronics, vol. 70, no. 7, pp. 7073–7081, 2022.
  11. K. Lee, D. Isele, E. A. Theodorou, and S. Bae, “Spatiotemporal costmap inference for mpc via deep inverse reinforcement learning,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 3194–3201, 2022.
  12. M. Aspri, G. Tsagkatakis, and P. Tsakalides, “Distributed training and inference of deep learning models for multi-modal land cover classification,” Remote Sensing, vol. 12, no. 17, p. 2670, 2020.
  13. J. M. Haut, M. E. Paoletti, S. Moreno-Álvarez, J. Plaza, J.-A. Rico-Gallego, and A. Plaza, “Distributed deep learning for remote sensing data interpretation,” Proceedings of the IEEE, vol. 109, no. 8, pp. 1320–1349, 2021.
  14. R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
  15. T.-C. Chiu, Y.-Y. Shih, A.-C. Pang, C.-S. Wang, W. Weng, and C.-T. Chou, “Semisupervised distributed learning with non-iid data for aiot service platform,” IEEE Internet of Things Journal, vol. 7, no. 10, pp. 9266–9277, 2020.
  16. C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and wireless networking: A survey,” IEEE Communications Surveys & Tutorials, vol. 21, no. 3, pp. 2224–2287, 2019.
  17. N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C. Liang, and D. I. Kim, “Applications of deep reinforcement learning in communications and networking: A survey,” IEEE Communications Surveys & Tutorials, vol. 21, no. 4, pp. 3133–3174, 2019.
  18. E. Rodríguez, B. Otero, N. Gutiérrez, and R. Canal, “A survey of deep learning techniques for cybersecurity in mobile networks,” IEEE Communications Surveys & Tutorials, vol. 23, no. 3, pp. 1920–1955, 2021.
  19. M. A. Al-Garadi, A. Mohamed, A. K. Al-Ali, X. Du, I. Ali, and M. Guizani, “A survey of machine and deep learning methods for internet of things (iot) security,” IEEE Communications Surveys & Tutorials, vol. 22, no. 3, pp. 1646–1685, 2020.
  20. E. Baccour, N. Mhaisen, A. A. Abdellatif, A. Erbad, A. Mohamed, M. Hamdi, and M. Guizani, “Pervasive ai for iot applications: A survey on resource-efficient distributed artificial intelligence,” IEEE Communications Surveys & Tutorials, vol. 24, no. 4, pp. 2366–2418, 2022.
  21. Y. Chen, X. Sun, and Y. Jin, “Communication-efficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 10, pp. 4229–4238, 2020.
  22. Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy efficient federated learning over wireless communication networks,” IEEE Transactions on Wireless Communications, vol. 20, no. 3, pp. 1935–1949, 2020.
  23. S. Wang, D. Li, J. Geng, Y. Gu, and Y. Cheng, “Impact of network topology on the performance of dml: Theoretical analysis and practical factors,” in IEEE INFOCOM 2019-IEEE conference on computer communications.   IEEE, 2019, pp. 1729–1737.
  24. J. Xue, Y. Miao, C. Chen, M. Wu, L. Zhang, and L. Zhou, “Fast distributed deep learning over rdma,” in Proceedings of the Fourteenth EuroSys Conference 2019, 2019, pp. 1–14.
  25. J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” Acm computing surveys (csur), vol. 53, no. 2, pp. 1–33, 2020.
  26. T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,” ACM Computing Surveys (CSUR), vol. 52, no. 4, pp. 1–43, 2019.
  27. R. Mayer and H.-A. Jacobsen, “Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools,” ACM Computing Surveys (CSUR), vol. 53, no. 1, pp. 1–37, 2020.
  28. E. P. Xing, Q. Ho, P. Xie, and D. Wei, “Strategies and principles of distributed machine learning on big data,” Engineering, vol. 2, no. 2, pp. 179–195, 2016.
  29. S. Ouyang, D. Dong, Y. Xu, and L. Xiao, “Communication optimization strategies for distributed deep neural network training: A survey,” Journal of Parallel and Distributed Computing, vol. 149, pp. 52–65, 2021.
  30. E. Yu, D. Dong, and X. Liao, “Communication optimization algorithms for distributed deep learning systems: A survey,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 12, pp. 3294–3308, 2023.
  31. X. Cao, T. Başar, S. Diggavi, Y. C. Eldar, K. B. Letaief, H. V. Poor, and J. Zhang, “Communication-efficient distributed learning: An overview,” IEEE journal on selected areas in communications, 2023.
  32. Z. Tang, S. Shi, X. Chu, W. Wang, and B. Li, “Communication-efficient distributed deep learning: A comprehensive survey,” arXiv preprint arXiv:2003.06307, 2023.
  33. J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” Proceedings of the IEEE, vol. 107, no. 11, pp. 2204–2239, 2019.
  34. Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.
  35. Y. Shi, K. Yang, T. Jiang, J. Zhang, and K. B. Letaief, “Communication-efficient edge ai: Algorithms and systems,” IEEE Communications Surveys & Tutorials, vol. 22, no. 4, pp. 2167–2191, 2020.
  36. S. Deng, H. Zhao, W. Fang, J. Yin, S. Dustdar, and A. Y. Zomaya, “Edge intelligence: The confluence of edge computing and artificial intelligence,” IEEE Internet of Things Journal, vol. 7, no. 8, pp. 7457–7469, 2020.
  37. M. Chen, N. Shlezinger, H. V. Poor, Y. C. Eldar, and S. Cui, “Communication-efficient federated learning,” Proceedings of the National Academy of Sciences, vol. 118, no. 17, p. e2024789118, 2021.
  38. M. S. Murshed, C. Murphy, D. Hou, N. Khan, G. Ananthanarayanan, and F. Hussain, “Machine learning at the network edge: A survey,” ACM Computing Surveys (CSUR), vol. 54, no. 8, pp. 1–37, 2021.
  39. J. Liu, J. Huang, Y. Zhou, X. Li, S. Ji, H. Xiong, and D. Dou, “From distributed machine learning to federated learning: A survey,” Knowledge and Information Systems, vol. 64, no. 4, pp. 885–917, 2022.
  40. S. Duan, D. Wang, J. Ren, F. Lyu, Y. Zhang, H. Wu, and X. Shen, “Distributed artificial intelligence empowered by end-edge-cloud computing: A survey,” IEEE Communications Surveys & Tutorials, 2022.
  41. J. Yao, S. Zhang, Y. Yao, F. Wang, J. Ma, J. Zhang, Y. Chu, L. Ji, K. Jia, T. Shen et al., “Edge-cloud polarization and collaboration: A comprehensive survey for ai,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 7, pp. 6866–6886, 2022.
  42. Y. E. Sagduyu, S. Ulukus, and A. Yener, “Task-oriented communications for nextg: End-to-end deep learning and ai security aspects,” IEEE Wireless Communications, vol. 30, no. 3, pp. 52–60, 2023.
  43. P. Saikia, S. Biswas, K. Singh, and C.-P. Li, “Signal detection in gsm-based in-band full-duplex communication using dnn,” IEEE Transactions on Vehicular Technology, vol. 72, no. 2, pp. 2661–2666, 2022.
  44. P. Ferrand, A. Decurninge, and M. Guillaud, “Dnn-based localization from channel estimates: Feature design and experimental results,” in GLOBECOM 2020-2020 IEEE Global Communications Conference.   IEEE, 2020, pp. 1–6.
  45. I. Amerini, C.-T. Li, and R. Caldelli, “Social network identification through image classification with cnn,” IEEE access, vol. 7, pp. 35 264–35 273, 2019.
  46. C. Zhang, W. Jiang, Y. Zhang, W. Wang, Q. Zhao, and C. Wang, “Transformer and cnn hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–20, 2022.
  47. R. Bi, J. Xiong, Y. Tian, Q. Li, and K.-K. R. Choo, “Achieving lightweight and privacy-preserving object detection for connected autonomous vehicles,” IEEE Internet of Things Journal, vol. 10, no. 3, pp. 2314–2329, 2022.
  48. Y. Hua, Z. Zhao, R. Li, X. Chen, Z. Liu, and H. Zhang, “Deep learning with long short-term memory for time series prediction,” IEEE Communications Magazine, vol. 57, no. 6, pp. 114–119, 2019.
  49. Q. Feng, D. He, Z. Liu, H. Wang, and K.-K. R. Choo, “Securenlp: A system for multi-party privacy-preserving natural language processing,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 3709–3721, 2020.
  50. B. Say, “A unified framework for planning with learned neural network transition models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 6, 2021, pp. 5016–5024.
  51. M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efficient mini-batch training for stochastic optimization,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 661–670.
  52. I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in Proceedings of the 30th International Conference on Machine Learning, vol. 28, no. 3.   Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1139–1147.
  53. J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization.” Journal of machine learning research, vol. 12, no. 7, 2011.
  54. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  55. M. Lotfollahi, M. Jafari Siavoshani, R. Shirali Hossein Zade, and M. Saberian, “Deep packet: A novel approach for encrypted traffic classification using deep learning,” Soft Computing, vol. 24, no. 3, pp. 1999–2012, 2020.
  56. L. Vu, Q. U. Nguyen, D. N. Nguyen, D. T. Hoang, E. Dutkiewicz et al., “Learning latent representation for iot anomaly detection,” IEEE Transactions on Cybernetics, vol. 52, no. 5, pp. 3769–3782, 2020.
  57. G. Li, M. Müller, B. Ghanem, and V. Koltun, “Training graph neural networks with 1000 layers,” in International conference on machine learning.   PMLR, 2021, pp. 6437–6449.
  58. B. Hanin, “Which neural net architectures give rise to exploding and vanishing gradients?” Advances in neural information processing systems, vol. 31, 2018.
  59. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  60. S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “Pytorch distributed: Experiences on accelerating data parallel training,” Proc. VLDB Endow., vol. 13, no. 12, p. 3005–3018, aug 2020.
  61. M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter server,” in Advances in Neural Information Processing Systems, vol. 27, 2014, pp. 19–27.
  62. A. N. Gomez, O. Key, K. Perlin, S. Gou, N. Frosst, J. Dean, and Y. Gal, “Interlocking backpropagation: Improving depthwise model-parallelism,” J. Mach. Learn. Res., vol. 23, no. 1, jan 2022.
  63. S. Li and T. Hoefler, “Chimera: efficiently training large-scale neural networks with bidirectional pipelines,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14.
  64. W. Liu, Z. Lai, S. Li, Y. Duan, K. Ge, and D. Li, “Autopipe: A fast pipeline parallelism approach with balanced partitioning and micro-batch slicing,” in 2022 IEEE International Conference on Cluster Computing (CLUSTER).   IEEE, 2022, pp. 301–312.
  65. H. Oh, J. Lee, H. Kim, and J. Seo, “Out-of-order backprop: An effective scheduling technique for deep learning,” in Proceedings of the Seventeenth European Conference on Computer Systems, 2022, pp. 435–452.
  66. J. M. Tarnawski, D. Narayanan, and A. Phanishayee, “Piper: Multidimensional planner for dnn parallelization,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 829–24 840, 2021.
  67. C. Unger, Z. Jia, W. Wu, S. Lin, M. Baines, C. E. Q. Narvaez, V. Ramakrishnaiah, N. Prajapati, P. McCormick, J. Mohd-Yusof et al., “Unity: Accelerating dnn training through joint optimization of algebraic transformations and parallelization,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 267–284.
  68. J. Zhou, Q. Shi, Y. Ding, L. Wang, L. Li, and F. Zhu, “Anttune: An efficient distributed hyperparameter optimization system for large-scale data,” in International Conference on Database Systems for Advanced Applications.   Springer, 2023, pp. 477–489.
  69. A. V. Joshi and A. V. Joshi, “Amazon’s machine learning toolkit: Sagemaker,” Machine learning and artificial intelligence, pp. 233–243, 2020.
  70. W. Ma, T. Zhou, J. Qin, X. Xiang, Y. Tan, and Z. Cai, “A privacy-preserving content-based image retrieval method based on deep learning in cloud computing,” Expert Systems with Applications, vol. 203, p. 117508, 2022.
  71. F. Desai, D. Chowdhury, R. Kaur, M. Peeters, R. C. Arya, G. S. Wander, S. S. Gill, and R. Buyya, “Healthcloud: A system for monitoring health status of heart patients using machine learning and cloud computing,” Internet of Things, vol. 17, p. 100485, 2022.
  72. X. Wang, L. Zhang, Y. Liu, C. Zhao, and K. Wang, “Solving task scheduling problems in cloud manufacturing via attention mechanism and deep reinforcement learning,” Journal of Manufacturing Systems, vol. 65, pp. 452–468, 2022.
  73. L. Zhang, C. Yang, Y. Yan, and Y. Hu, “Distributed real-time scheduling in cloud manufacturing by deep reinforcement learning,” IEEE Transactions on Industrial Informatics, vol. 18, no. 12, pp. 8999–9007, 2022.
  74. M. Xu, W. C. Ng, W. Y. B. Lim, J. Kang, Z. Xiong, D. Niyato, Q. Yang, X. S. Shen, and C. Miao, “A full dive into realizing the edge-enabled metaverse: Visions, enabling technologies, and challenges,” IEEE Communications Surveys & Tutorials, 2022.
  75. P. Arthurs, L. Gillam, P. Krause, N. Wang, K. Halder, and A. Mouzakitis, “A taxonomy and survey of edge cloud computing for intelligent transportation systems and connected vehicles,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 6206–6221, 2022.
  76. Y. Wu, B. Yang, D. Zhu, Q. Liu, C. Li, C. Chen, and X. Guan, “To transmit or predict: An efficient industrial data transmission scheme with deep learning and cloud-edge collaboration,” IEEE Transactions on Industrial Informatics, vol. 19, no. 11, pp. 11 322–11 332, 2023.
  77. Q. Wu, X. Chen, Z. Zhou, and J. Zhang, “Fedhome: Cloud-edge based personalized federated learning for in-home health monitoring,” IEEE Transactions on Mobile Computing, vol. 21, no. 8, pp. 2818–2832, 2022.
  78. J. A. Alzubi, O. A. Alzubi, A. Singh, and M. Ramachandran, “Cloud-iiot-based electronic health record privacy-preserving by cnn and blockchain-enabled federated learning,” IEEE Transactions on Industrial Informatics, vol. 19, no. 1, pp. 1080–1087, 2023.
  79. D. C. Nguyen, M. Ding, Q.-V. Pham, P. N. Pathirana, L. B. Le, A. Seneviratne, J. Li, D. Niyato, and H. V. Poor, “Federated learning meets blockchain in edge computing: Opportunities and challenges,” IEEE Internet of Things Journal, vol. 8, no. 16, pp. 12 806–12 825, 2021.
  80. Y. Li, X. Tao, X. Zhang, J. Liu, and J. Xu, “Privacy-preserved federated learning for autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 8423–8434, 2022.
  81. O. Bouachir, M. Aloqaily, Ö. Özkasap, and F. Ali, “Federatedgrids: Federated learning and blockchain-assisted p2p energy sharing,” IEEE Transactions on Green Communications and Networking, vol. 6, no. 1, pp. 424–436, 2022.
  82. S. Batra, Z. Huang, A. Petrenko, T. Kumar, A. Molchanov, and G. S. Sukhatme, “Decentralized control of quadrotor swarms with end-to-end deep reinforcement learning,” in Proceedings of the 5th Conference on Robot Learning, vol. 164.   PMLR, 08–11 Nov 2022, pp. 576–586.
  83. X. Zhou, W. Liang, K. I.-K. Wang, Z. Yan, L. T. Yang, W. Wei, J. Ma, and Q. Jin, “Decentralized p2p federated learning for privacy-preserving and resilient mobile robotic systems,” IEEE Wireless Communications, vol. 30, no. 2, pp. 82–89, 2023.
  84. D. Liu, X. Chen, Z. Zhou, and Q. Ling, “Hiertrain: Fast hierarchical edge ai learning with hybrid parallelism in mobile-edge-cloud computing,” IEEE Open Journal of the Communications Society, vol. 1, pp. 634–645, 2020.
  85. Z. Chen, L. Shi, X. Liu, J. Li, S. Liu, and Y. Xu, “Osp: Boosting distributed model training with 2-stage synchronization,” in Proceedings of the 52nd International Conference on Parallel Processing, New York, NY, USA, 2023, p. 102–111.
  86. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. a. Ranzato, A. Senior, P. Tucker, K. Yang, Q. Le, and A. Ng, “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems, vol. 25, 2012, pp. 1223–1231.
  87. A.-L. Jin, W. Xu, S. Guo, B. Hu, and K. Yeung, “Ps+: A simple yet effective framework for fast training on parameter server,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 12, pp. 4625–4637, 2022.
  88. Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing, “More effective distributed ml via a stale synchronous parallel parameter server,” in Advances in Neural Information Processing Systems, vol. 26, 2013, pp. 1223–1231.
  89. B. McMahan and M. Streeter, “Delay-tolerant algorithms for asynchronous distributed online learning,” in Advances in Neural Information Processing Systems, vol. 27, 2014, pp. 2915–2923.
  90. C. Chen, W. Wang, and B. Li, “Round-robin synchronization: Mitigating communication bottlenecks in parameter servers,” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications.   IEEE, 2019, pp. 532–540.
  91. Y. Li, J. Huang, Z. Li, S. Zhou, W. Jiang, and J. Wang, “Hsp: Hybrid synchronous parallelism for fast distributed deep learning,” in Proceedings of the 51st International Conference on Parallel Processing, New York, NY, USA, 2022, pp. 1–11.
  92. S. Zhang, A. E. Choromanska, and Y. LeCun, “Deep learning with elastic averaging sgd,” in Advances in Neural Information Processing Systems, vol. 28, 2015, pp. 685–693.
  93. J. Wang and G. Joshi, “Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd,” in Proceedings of Machine Learning and Systems, vol. 1, 2019, pp. 212–229.
  94. T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-batches, use local sgd,” in International Conference on Learning Representations, 2020.
  95. J. Wang, V. Tantia, N. Ballas, and M. Rabbat, “Slowmo: Improving communication-efficient distributed sgd with slow momentum,” in International Conference on Learning Representations, 2020.
  96. T. Chen, G. Giannakis, T. Sun, and W. Yin, “Lag: Lazily aggregated gradient for communication-efficient distributed learning,” Advances in neural information processing systems, vol. 31, p. 5055–5065, 2018.
  97. J. George and P. Gurram, “Distributed stochastic gradient descent with event-triggered communication,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 7169–7178.
  98. Z. Wang, Y. Tu, N. Wang, L. Gao, J. Nie, Z. Wei, Y. Gu, and G. Yu, “Fsp: Towards flexible synchronous parallel frameworks for distributed machine learning,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 2, pp. 687–703, 2023.
  99. X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5330–5340.
  100. H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Decentralized training over decentralized data,” in International Conference on Machine Learning.   PMLR, 2018, pp. 4848–4856.
  101. H. Sun, Z. Gui, S. Guo, Q. Qi, J. Wang, and J. Liao, “Gssp: Eliminating stragglers through grouping synchronous for distributed deep learning in heterogeneous cluster,” IEEE Transactions on Cloud Computing, vol. 10, no. 4, pp. 2637–2648, 2022.
  102. M. Tan, W.-X. Liu, J. Luo, H. Chen, and Z.-Z. Guo, “Adaptive synchronous strategy for distributed machine learning,” International Journal of Intelligent Systems, vol. 37, no. 12, pp. 11 713–11 741, 2022.
  103. Z. Shen, Q. Tang, T. Zhou, Y. Zhang, Z. Jia, D. Yu, Z. Zhang, and B. Li, “Ashl: An adaptive multi-stage distributed deep learning training scheme for heterogeneous environments,” IEEE Transactions on Computers, pp. 1–14, 2023.
  104. F. Zhou and G. Cong, “On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 3219–3227.
  105. S. U. Stich, “Local sgd converges fast and communicates little,” in ICLR 2019-International Conference on Learning Representations, 2019.
  106. H. Yu, S. Yang, and S. Zhu, “Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning,” ser. AAAI’19/IAAI’19/EAAI’19.   AAAI Press, 2019.
  107. F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local sgd with periodic averaging: Tighter analysis and adaptive synchronization,” in Advances in Neural Information Processing Systems, vol. 32, 2019, pp. 11 082–11 094.
  108. H. Yu and R. Jin, “On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic non-convex optimization,” in Proceedings of the 36th International Conference on Machine Learning, vol. 97.   PMLR, 09–15 Jun 2019, pp. 7174–7183.
  109. B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, B. Mcmahan, O. Shamir, and N. Srebro, “Is local SGD better than minibatch SGD?” in Proceedings of the 37th International Conference on Machine Learning, vol. 119.   PMLR, 13–18 Jul 2020, pp. 10 334–10 343.
  110. A. Spiridonoff, A. Olshevsky, and Y. Paschalidis, “Communication-efficient sgd: From local sgd to one-shot averaging,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 24 313–24 326.
  111. J. Wang and G. Joshi, “Cooperative sgd: A unified framework for the design and analysis of local-update sgd algorithms,” vol. 22, no. 1, jan 2021.
  112. S. U. Stich and S. P. Karimireddy, “The error-feedback framework: Better rates for sgd with delayed gradients and compressed updates,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 9613–9648, 2020.
  113. R. Z. Aviv, I. Hakimi, A. Schuster, and K. Y. Levy, “Asynchronous distributed learning : Adapting to gradient delays without prior knowledge,” in Proceedings of the 38th International Conference on Machine Learning, vol. 139.   PMLR, 18–24 Jul 2021, pp. 436–445.
  114. A. Cohen, A. Daniely, Y. Drori, T. Koren, and M. Schain, “Asynchronous stochastic optimization robust to arbitrary delays,” Advances in Neural Information Processing Systems, vol. 34, pp. 9024–9035, 2021.
  115. A. Koloskova, S. U. Stich, and M. Jaggi, “Sharper convergence guarantees for asynchronous sgd for distributed and federated learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 202–17 215, 2022.
  116. K. Mishchenko, F. Bach, M. Even, and B. E. Woodworth, “Asynchronous sgd beats minibatch sgd under arbitrary delays,” Advances in Neural Information Processing Systems, vol. 35, pp. 420–433, 2022.
  117. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in Artificial intelligence and statistics.   PMLR, 2017, pp. 1273–1282.
  118. P. Zhou, Q. Lin, D. Loghin, B. C. Ooi, Y. Wu, and H. Yu, “Communication-efficient decentralized machine learning over heterogeneous networks,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021, pp. 384–395.
  119. C. Chen, H. Xu, W. Wang, B. Li, B. Li, L. Chen, and G. Zhang, “Communication-efficient federated learning with adaptive parameter freezing,” in 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS), 2021, pp. 1–11.
  120. ——, “Synchronize only the immature parameters: Communication-efficient federated learning by freezing parameters adaptively,” IEEE Transactions on Parallel and Distributed Systems, pp. 1–18, 2023.
  121. J. Liu, J. Liu, H. Xu, Y. Liao, Z. Wang, and Q. Ma, “Yoga: Adaptive layer-wise model aggregation for decentralized federated learning,” IEEE/ACM Transactions on Networking, 2023.
  122. T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” in Proceedings of Machine Learning and Systems, vol. 2, 2020, pp. 429–450.
  123. J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor, “Tackling the objective inconsistency problem in heterogeneous federated optimization,” Advances in neural information processing systems, vol. 33, pp. 7611–7623, 2020.
  124. Y. Esfandiari, S. Y. Tan, Z. Jiang, A. Balu, E. Herron, C. Hegde, and S. Sarkar, “Cross-gradient aggregation for decentralized learning from non-iid data,” in Proceedings of the 38th International Conference on Machine Learning, vol. 139.   PMLR, 18–24 Jul 2021, pp. 3036–3046.
  125. M. Chen, Y. Xu, H. Xu, and L. Huang, “Enhancing decentralized federated learning for non-iid data on heterogeneous devices,” in 2023 IEEE 39th International Conference on Data Engineering (ICDE), 2023, pp. 2289–2302.
  126. K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečný, S. Mazzocchi, B. McMahan, T. Van Overveldt, D. Petrou, D. Ramage, and J. Roselander, “Towards federated learning at scale: System design,” in Proceedings of Machine Learning and Systems, vol. 1, 2019, pp. 374–388.
  127. Z. Wang, H. Xu, J. Liu, Y. Xu, H. Huang, and Y. Zhao, “Accelerating federated learning with cluster construction and hierarchical aggregation,” IEEE Transactions on Mobile Computing, vol. 22, no. 7, pp. 3805–3822, 2023.
  128. F. P.-C. Lin, S. Hosseinalipour, S. S. Azam, C. G. Brinton, and N. Michelusi, “Semi-decentralized federated learning with cooperative d2d local model aggregations,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 12, pp. 3851–3869, 2021.
  129. M. Ryabinin, E. Gorbunov, V. Plokhotnyuk, and G. Pekhimenko, “Moshpit sgd: Communication-efficient decentralized training on heterogeneous unreliable devices,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 18 195–18 211.
  130. S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 6, pp. 1205–1221, 2019.
  131. Y. Xu, Y. Liao, H. Xu, Z. Ma, L. Wang, and J. Liu, “Adaptive control of local updating and model compression for efficient federated learning,” IEEE Transactions on Mobile Computing, vol. 22, no. 10, pp. 5675–5689, 2023.
  132. Y. Liao, Y. Xu, H. Xu, Z. Yao, L. Wang, and C. Qiao, “Accelerating federated learning with data and model parallelism in edge computing,” IEEE/ACM Transactions on Networking, pp. 1–15, 2023.
  133. Z. Wang, H. Xu, Y. Xu, Z. Jiang, J. Liu, and S. Chen, “Fast: Enhancing federated learning through adaptive data sampling and local training,” IEEE Transactions on Parallel and Distributed Systems, pp. 1–15, 2023.
  134. J. Park, D. Yoon, S. Yeo, and S. Oh, “Amble: Adjusting mini-batch and local epoch for federated learning with heterogeneous devices,” Journal of Parallel and Distributed Computing, vol. 170, pp. 13–23, 2022.
  135. Y. Liu, L. Xu, X. Yuan, C. Wang, and B. Li, “The right to be forgotten in federated learning: An efficient realization with rapid retraining,” in IEEE INFOCOM 2022 - IEEE Conference on Computer Communications, 2022, pp. 1749–1758.
  136. C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl, “Measuring the effects of data parallelism on neural network training,” Journal of Machine Learning Research, vol. 20, no. 112, pp. 1–49, 2019.
  137. H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition,” in Machine Learning and Knowledge Discovery in Databases, 2016, pp. 795–811.
  138. B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro, “The role of over-parametrization in generalization of neural networks,” in International Conference on Learning Representations, 2019.
  139. K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 2017, p. 1175–1191.
  140. Z. Yao, A. Gholami, S. Shen, M. Mustafa, K. Keutzer, and M. Mahoney, “Adahessian: An adaptive second order optimizer for machine learning,” in proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 12, 2021, pp. 10 665–10 673.
  141. F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns,” in Fifteenth annual conference of the international speech communication association, 2014.
  142. J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signsgd: Compressed optimisation for non-convex problems,” in International Conference on Machine Learning.   PMLR, 2018, pp. 560–569.
  143. S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi, “Error feedback fixes signsgd and other gradient compression schemes,” in International Conference on Machine Learning.   PMLR, 2019, pp. 3252–3261.
  144. H. Tang, S. Gan, A. A. Awan, S. Rajbhandari, C. Li, X. Lian, J. Liu, C. Zhang, and Y. He, “1-bit adam: Communication efficient large-scale training with adam’s convergence speed,” in Proceedings of the 38th International Conference on Machine Learning, vol. 139.   PMLR, 18–24 Jul 2021, pp. 10 118–10 129.
  145. S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in International conference on machine learning.   PMLR, 2015, pp. 1737–1746.
  146. S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160, 2016.
  147. I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6869–6898, 2017.
  148. H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang, “Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning,” in International Conference on Machine Learning.   PMLR, 2017, pp. 4035–4043.
  149. S. Horvóth, C.-Y. Ho, L. Horvath, A. N. Sahu, M. Canini, and P. Richtárik, “Natural compression for distributed deep learning,” in Mathematical and Scientific Machine Learning.   PMLR, 2022, pp. 129–141.
  150. A. T. Suresh, X. Y. Felix, S. Kumar, and H. B. McMahan, “Distributed mean estimation with limited communication,” in International conference on machine learning.   PMLR, 2017, pp. 3329–3337.
  151. H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright, “Atomo: Communication-efficient learning via atomic sparsification,” Advances in neural information processing systems, vol. 31, 2018.
  152. T. Vogels, S. P. Karimireddy, and M. Jaggi, “Powersgd: Practical low-rank gradient compression for distributed optimization,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  153. ——, “Practical low-rank communication compression in decentralized deep learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 171–14 181, 2020.
  154. Y. Dong, L. Wang, J. Wang, X. Hu, H. Zhang, F. R. Yu, and V. C. M. Leung, “Accelerating wireless federated learning via nesterov’s momentum and distributed principle component analysis,” IEEE Transactions on Wireless Communications, pp. 1–1, 2023.
  155. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” Advances in neural information processing systems, vol. 30, 2017.
  156. J. Konečnỳ and P. Richtárik, “Randomized distributed mean estimation: Accuracy vs. communication,” Frontiers in Applied Mathematics and Statistics, vol. 4, p. 62, 2018.
  157. W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad: Ternary gradients to reduce communication in distributed deep learning,” Advances in neural information processing systems, vol. 30, 2017.
  158. J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quantized sgd and its applications to large-scale distributed optimization,” in International Conference on Machine Learning.   PMLR, 2018, pp. 5325–5333.
  159. A. Ramezani-Kebrya, F. Faghri, I. Markov, V. Aksenov, D. Alistarh, and D. M. Roy, “Nuqsgd: Provably communication-efficient data-parallel sgd via nonuniform quantization,” Journal of Machine Learning Research, vol. 22, no. 1, jan 2021.
  160. F. Faghri, I. Tabrizian, I. Markov, D. Alistarh, D. M. Roy, and A. Ramezani-Kebrya, “Adaptive gradient quantization for data-parallel sgd,” Advances in neural information processing systems, vol. 33, pp. 3174–3185, 2020.
  161. K. Mishchenko, B. Wang, D. Kovalev, and P. Richtárik, “IntSGD: Adaptive floatless compression of stochastic gradients,” in International Conference on Learning Representations, 2022.
  162. Y. Mao, Z. Zhao, G. Yan, Y. Liu, T. Lan, L. Song, and W. Ding, “Communication-efficient federated learning with adaptive quantization,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 13, no. 4, pp. 1–26, 2022.
  163. H. Liu, F. He, and G. Cao, “Communication-efficient federated learning for heterogeneous edge devices based on adaptive gradient quantization,” in IEEE INFOCOM 2023-IEEE Conference on Computer Communications.   IEEE, 2023, pp. 1–10.
  164. N. Ström, “Scalable distributed dnn training using commodity gpu cloud computing,” in Interspeech 2015, 2015.
  165. A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 440–445.
  166. S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsified sgd with memory,” in Advances in Neural Information Processing Systems, vol. 31.   Curran Associates, Inc., 2018.
  167. Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” in International Conference on Learning Representations, 2018.
  168. D. Xiao, Y. Mei, D. Kuang, M. Chen, B. Guo, and W. Wu, “Egc: Entropy-based gradient compression for distributed deep learning,” Information Sciences, vol. 548, pp. 118–134, 2021.
  169. Z. Zhang and C. Wang, “Mipd: An adaptive gradient sparsification framework for distributed dnns training,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 11, pp. 3053–3066, 2022.
  170. S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X. Chu, “A distributed synchronous sgd algorithm with global top-k sparsification for low bandwidth networks,” in 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), 2019, pp. 2238–2247.
  171. C.-Y. Chen, J. Ni, S. Lu, X. Cui, P.-Y. Chen, X. Sun, N. Wang, S. Venkataramani, V. V. Srinivasan, W. Zhang et al., “Scalecom: Scalable sparsified gradient compression for communication-efficient distributed training,” Advances in Neural Information Processing Systems, vol. 33, pp. 13 551–13 563, 2020.
  172. A. M Abdelmoniem, A. Elzanaty, M.-S. Alouini, and M. Canini, “An efficient statistical-based gradient compression technique for distributed training systems,” Proceedings of Machine Learning and Systems, vol. 3, pp. 297–322, 2021.
  173. S. Shi, X. Zhou, S. Song, X. Wang, Z. Zhu, X. Huang, X. Jiang, F. Zhou, Z. Guo, L. Xie et al., “Towards scalable distributed training of deep learning on public cloud clusters,” Proceedings of Machine Learning and Systems, vol. 3, pp. 401–412, 2021.
  174. S. Li and T. Hoefler, “Near-optimal sparse allreduce for distributed deep learning,” in Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022, pp. 135–149.
  175. R. Liu and B. Mozafari, “Communication-efficient distributed learning for large batch optimization,” in International Conference on Machine Learning.   PMLR, 2022, pp. 13 925–13 946.
  176. J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsification for communication-efficient distributed optimization,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  177. D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, “The convergence of sparsified gradient methods,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  178. L. Wang, W. Wu, J. Zhang, H. Liu, G. Bosilca, M. Herlihy, and R. Fonseca, “Fft-based gradient sparsification for the distributed training of deep neural networks,” in Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, 2020, pp. 113–124.
  179. A. Sahu, A. Dutta, A. M. Abdelmoniem, T. Banerjee, M. Canini, and P. Kalnis, “Rethinking gradient sparsification as total error minimization,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 8133–8146.
  180. S. Shi, Q. Wang, X. Chu, B. Li, Y. Qin, R. Liu, and X. Zhao, “Communication-efficient distributed deep learning with merged gradient sparsification on gpus,” in IEEE INFOCOM 2020-IEEE Conference on Computer Communications.   IEEE, 2020, pp. 406–415.
  181. Z. Wang, Z. Xu, X. Wu, A. Shrivastava, and T. S. E. Ng, “DRAGONN: Distributed randomized approximate gradients of neural networks,” in Proceedings of the 39th International Conference on Machine Learning, vol. 162.   PMLR, 17–23 Jul 2022, pp. 23 274–23 291.
  182. F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Robust and communication-efficient federated learning from non-i.i.d. data,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 9, pp. 3400–3413, 2020.
  183. Z. Tang, S. Shi, B. Li, and X. Chu, “Gossipfl: A decentralized federated learning framework with sparsified and adaptive communication,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 3, pp. 909–922, 2022.
  184. L. Yi, W. Gang, and L. Xiaoguang, “QSFL: A two-level uplink communication optimization framework for federated learning,” in Proceedings of the 39th International Conference on Machine Learning, vol. 162.   PMLR, 17–23 Jul 2022, pp. 25 501–25 513.
  185. P. Han, S. Wang, and K. K. Leung, “Adaptive gradient sparsification for efficient federated learning: An online learning approach,” in 2020 IEEE 40th international conference on distributed computing systems (ICDCS).   IEEE, 2020, pp. 300–310.
  186. Z. Feng, X. Chen, Q. Wu, W. Wu, X. Zhang, and Q. Huang, “Feddd: Toward communication-efficient federated learning with differential parameter dropout,” IEEE Transactions on Mobile Computing, pp. 1–18, 2023.
  187. J. Fang, H. Fu, G. Yang, and C.-J. Hsieh, “Redsync: reducing synchronization bandwidth for distributed deep learning training system,” Journal of Parallel and Distributed Computing, vol. 133, pp. 30–39, 2019.
  188. P. Jiang and G. Agrawal, “A linear speedup analysis of distributed deep learning with sparse and quantized communication,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  189. D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-sgd: Distributed sgd with quantization, sparsification and local computations,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  190. G. Yan, T. Li, S.-L. Huang, T. Lan, and L. Song, “Ac-sgd: Adaptively compressed sgd for communication-efficient distributed learning,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 9, pp. 2678–2693, 2022.
  191. H. Lim, D. G. Andersen, and M. Kaminsky, “3lc: Lightweight and effective traffic compression for distributed machine learning,” Proceedings of Machine Learning and Systems, vol. 1, pp. 53–64, 2019.
  192. C. Yang, Y. Zhao, G. Zhao, and H. Xu, “Dfs: Joint data formatting and sparsification for efficient communication in distributed machine learning,” Computer Networks, vol. 229, p. 109777, 2023.
  193. R. Song, L. Zhou, L. Lyu, A. Festag, and A. Knoll, “Resfed: Communication efficient federated learning with deep compressed residuals,” IEEE Internet of Things Journal, pp. 1–1, 2023.
  194. L. Abrahamyan, Y. Chen, G. Bekoulis, and N. Deligiannis, “Learned gradient compression for distributed deep learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 12, pp. 7330–7344, 2021.
  195. Z. Wang, M. Wen, Y. Xu, Y. Zhou, J. H. Wang, and L. Zhang, “Communication compression techniques in distributed deep learning: A survey,” Journal of Systems Architecture, p. 102927, 2023.
  196. R. Banner, I. Hubara, E. Hoffer, and D. Soudry, “Scalable methods for 8-bit training of neural networks,” in Advances in Neural Information Processing Systems, vol. 31.   Curran Associates, Inc., 2018.
  197. G. Stewart and J. Miller, “Methods of simultaneous iteration for calculating eigenvectors of matrices,” Topics in Numerical Analysis II, vol. 2, 1975.
  198. S. Gunasekar, J. Lee, D. Soudry, and N. Srebro, “Characterizing implicit bias in terms of optimization geometry,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1832–1841.
  199. R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning.   Pmlr, 2013, pp. 1310–1318.
  200. I. Mitliagkas, C. Zhang, S. Hadjis, and C. Ré, “Asynchrony begets momentum, with an application to deep learning,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).   IEEE, 2016, pp. 997–1004.
  201. P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
  202. Y. You, I. Gitman, and B. Ginsburg, “Scaling sgd batch size to 32k for imagenet training,” arXiv preprint arXiv:1708.03888, vol. 6, no. 12, p. 6, 2017.
  203. Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” arXiv preprint arXiv:1904.00962, 2019.
  204. A. H. Robinson and C. Cherry, “Results of a prototype television bandwidth compression scheme,” Proceedings of the IEEE, vol. 55, no. 3, pp. 356–364, 1967.
  205. W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang et al., “Gandiva: Introspective cluster scheduling for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 595–610.
  206. W. Xiao, S. Ren, Y. Li, Y. Zhang, P. Hou, Z. Li, Y. Feng, W. Lin, and Y. Jia, “Antman: Dynamic scaling on gpu clusters for deep learning,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 533–548.
  207. Q. Weng, L. Yang, Y. Yu, W. Wang, X. Tang, G. Yang, and L. Zhang, “Beware of fragmentation: Scheduling gpu-sharing workloads with fragmentation gradient descent,” in 2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 995–1008.
  208. B. Wu, Z. Zhang, Z. Bai, X. Liu, and X. Jin, “Transparent gpu sharing in container clouds for deep learning workloads,” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 69–85.
  209. P. Yu and M. Chowdhury, “Salus: Fine-grained gpu sharing primitives for deep learning applications,” in Proceedings of the 3rd MLSys Conference, 2020, pp. 1–14.
  210. Z. Bai, Z. Zhang, Y. Zhu, and X. Jin, “Pipeswitch: Fast pipelined context switching for deep learning applications,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 499–514.
  211. Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo, “Optimus: an efficient dynamic resource scheduler for deep learning clusters,” in Proceedings of the Thirteenth EuroSys Conference, 2018, pp. 1–14.
  212. Y. Bao, Y. Peng, and C. Wu, “Deep learning-based job placement in distributed machine learning clusters,” in IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, 2019, pp. 505–513.
  213. G. Yeung, D. Borowiec, R. Yang, A. Friday, R. Harper, and P. Garraghan, “Horus: Interference-aware and prediction-based scheduling in deep learning systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 1, pp. 88–100, 2021.
  214. A. Qiao, S. K. Choe, S. J. Subramanya, W. Neiswanger, Q. Ho, H. Zhang, G. R. Ganger, and E. P. Xing, “Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning,” in 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21), Jul. 2021, pp. 1–18.
  215. G. Lim, J. Ahn, W. Xiao, Y. Kwon, and M. Jeon, “Zico: Efficient GPU memory sharing for concurrent DNN training,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21), Jul. 2021, pp. 161–175.
  216. C. Hwang, T. Kim, S. Kim, J. Shin, and K. Park, “Elastic resource sharing for distributed deep learning,” in 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), 2021, pp. 721–739.
  217. Y. Mao, V. Sharma, W. Zheng, L. Cheng, Q. Guan, and A. Li, “Elastic resource management for deep learning applications in a container cluster,” IEEE Transactions on Cloud Computing, vol. 11, no. 2, pp. 2204–2216, 2023.
  218. P. Yu, J. Liu, and M. Chowdhury, “Fluid: Resource-aware hyperparameter tuning engine,” in Proceedings of Machine Learning and Systems, vol. 3, 2021, pp. 502–516.
  219. W. Gao, P. Sun, Y. Wen, and T. Zhang, “Titan: a scheduler for foundation model fine-tuning workloads,” in Proceedings of the 13th Symposium on Cloud Computing, 2022, pp. 348–354.
  220. L. Liu, J. Yu, and Z. Ding, “Adaptive and efficient gpu time sharing for hyperparameter tuning in cloud,” in Proceedings of the 51st International Conference on Parallel Processing, 2022, pp. 1–11.
  221. Q. Hu, Z. Ye, M. Zhang, Q. Chen, P. Sun, Y. Wen, and T. Zhang, “Hydro: Surrogate-based hyperparameter tuning service in datacenters,” in 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), 2023, pp. 757–777.
  222. R. Gu, Y. Chen, S. Liu, H. Dai, G. Chen, K. Zhang, Y. Che, and Y. Huang, “Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed gpu clusters,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 11, pp. 2808–2820, 2022.
  223. Z. Zhang, Q. Qi, R. Shang, L. Chen, and F. Xu, “Prophet: Speeding up distributed dnn training with predictable communication scheduling,” in Proceedings of the 50th International Conference on Parallel Processing, 2021, pp. 1–11.
  224. W. Li, S. Chen, K. Li, H. Qi, R. Xu, and S. Zhang, “Efficient online scheduling for coflow-aware machine learning clusters,” IEEE Transactions on Cloud Computing, vol. 10, no. 4, pp. 2564–2579, 2020.
  225. A. Dhakal, S. G. Kulkarni, and K. Ramakrishnan, “Gslice: controlled spatial sharing of gpus for a scalable inference platform,” in Proceedings of the 11th ACM Symposium on Cloud Computing, 2020, pp. 492–506.
  226. F. Xu, J. Xu, J. Chen, L. Chen, R. Shang, Z. Zhou, and F. Liu, “igniter: Interference-aware gpu resource provisioning for predictable dnn inference in the cloud,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 3, pp. 812–827, 2022.
  227. J. Cho, D. Zad Tootaghaj, L. Cao, and P. Sharma, “Sla-driven ml inference framework for clouds with heterogeneous accelerators,” Proceedings of Machine Learning and Systems, vol. 4, pp. 20–32, 2022.
  228. H. Shen, L. Chen, Y. Jin, L. Zhao, B. Kong, M. Philipose, A. Krishnamurthy, and R. Sundaram, “Nexus: A gpu cluster engine for accelerating dnn-based video analysis,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 322–337.
  229. F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “Infaas: Automated model-less inference serving,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021, pp. 397–411.
  230. J. R. Gunasekaran, C. S. Mishra, P. Thinakaran, B. Sharma, M. T. Kandemir, and C. R. Das, “Cocktail: A multidimensional optimization for model serving in cloud,” in 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 1041–1057.
  231. S. Choi, S. Lee, Y. Kim, J. Park, Y. Kwon, and J. Huh, “Serving heterogeneous machine learning models on multi-gpu servers with spatio-temporal sharing,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 199–216.
  232. J. Gu, Y. Zhu, P. Wang, M. Chadha, and M. Gerndt, “Fast-gshare: Enabling efficient spatio-temporal gpu sharing in serverless computing for deep learning inference,” in Proceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 635–644.
  233. D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee, and M. Zaharia, “Heterogeneity-aware cluster scheduling policies for deep learning workloads,” in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020, pp. 481–498.
  234. Q. Hu, P. Sun, S. Yan, Y. Wen, and T. Zhang, “Characterization and prediction of deep learning workloads in large-scale gpu datacenters,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
  235. Q. Weng, W. Xiao, Y. Yu, W. Wang, C. Wang, J. He, Y. Li, L. Zhang, W. Lin, and Y. Ding, “Mlaas in the wild: Workload analysis and scheduling in large-scale heterogeneous gpu clusters,” in 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), 2022, pp. 945–960.
  236. J. Li, H. Xu, Y. Zhu, Z. Liu, C. Guo, and C. Wang, “Lyra: Elastic scheduling for deep learning clusters,” in Proceedings of the Eighteenth European Conference on Computer Systems, 2023, pp. 835–850.
  237. R. Cheng, C. Cai, S. Yilmaz, R. Mitra, M. Bag, M. Ghosh, and T. Xu, “Towards gpu memory efficiency for distributed training at scale,” in Proceedings of the 2023 ACM Symposium on Cloud Computing, 2023, pp. 281–297.
  238. Z. Ye, W. Gao, Q. Hu, P. Sun, X. Wang, Y. Luo, T. Zhang, and Y. Wen, “Deep learning workload scheduling in gpu datacenters: A survey,” ACM Computing Surveys, vol. 56, no. 6, pp. 1–38, 2024.
  239. “Nvidia multi-process service,” 2024. [Online]. Available: https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
  240. “Nvidia multi-instance gpu.” 2024. [Online]. Available: https://www.nvidia.com/en-us/technologies/multi-instance-gpu/
  241. “Unified memory for cuda,” 2024. [Online]. Available: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
  242. M. Amaral, J. Polo, D. Carrera, S. Seelam, and M. Steinder, “Topology-aware gpu scheduling for learning workloads in cloud environments,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1–12.
  243. J. Gu, M. Chowdhury, K. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo, “Tiresias: A GPU cluster manager for distributed deep learning,” Networked Systems Design and Implementation,Networked Systems Design and Implementation, pp. 485–500, Jan 2019.
  244. K. R. Jayaram, V. Muthusamy, P. Dube, V. Ishakian, C. Wang, B. Herta, S. Boag, D. Arroyo, A. Tantawi, A. Verma, F. Pollok, and R. Khalaf, “Ffdl: A flexible multi-tenant deep learning platform.” in Proceedings of the 20th International Middleware Conference, Dec 2019, pp. 82–95.
  245. M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang, “Analysis of large-scalemulti-tenantgpu clusters for dnn training workloads,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19), 2019, pp. 947–960.
  246. A. Sultana, L. Chen, F. Xu, and X. Yuan, “E-las: Design and analysis of completion-time agnostic scheduling for distributed deep learning cluster,” in 49th International Conference on Parallel Processing - ICPP, Aug 2020, pp. 1–11.
  247. M. Yu, C. Wu, B. Ji, and J. Liu, “A sum-of-ratios multi-dimensional-knapsack decomposition for dnn resource scheduling,” in IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, May 2021, pp. 1–10.
  248. C. Wang, N. Yoshikane, F. Balasis, and T. Tsuritani, “Osdl: Dedicated optical slice provisioning in support of distributed deep learning,” Computer Networks, vol. 214, p. 109191, 2022.
  249. Y. Luan, X. Chen, H. Zhao, Z. Yang, and Y. Dai, “Sched22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Scheduling deep learning training via deep reinforcement learning,” in 2019 IEEE Global Communications Conference (GLOBECOM).   IEEE, 2019, pp. 1–7.
  250. H. Wang, Z. Liu, and H. Shen, “Job scheduling for large-scale machine learning clusters,” in Proceedings of the 16th International Conference on emerging Networking EXperiments and Technologies, Nov 2020, pp. 108–120.
  251. D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “Pipedream: Generalized pipeline parallelism for dnn training,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 1–15.
  252. Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” Advances in neural information processing systems, vol. 32, 2019.
  253. L. Zhang, S. Shi, X. Chu, W. Wang, B. Li, and C. Liu, “Dear: Accelerating distributed deep learning with fine-grained all-reduce pipelining,” in 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS).   IEEE, 2023, pp. 142–153.
  254. J. H. Park, G. Yun, M. Y. Chang, N. T. Nguyen, S. Lee, J. Choi, S. H. Noh, and Y.-r. Choi, “Hetpipe: Enabling large dnn training on (whimpy) heterogeneous gpu clusters through integration of pipelined model parallelism and data parallelism,” in 2020 USENIX Annual Technical Conference (USENIX ATC 20), 2020, pp. 307–321.
  255. P. Zhou, X. He, S. Luo, H. Yu, and G. Sun, “Jpas: Job-progress-aware flow scheduling for deep learning clusters,” Journal of Network and Computer Applications, p. 102590, May 2020.
  256. S. Wang, D. Li, and J. Geng, “Geryon: Accelerating distributed cnn training by network-level flow scheduling,” in IEEE INFOCOM 2020-IEEE Conference on Computer Communications.   IEEE, 2020, pp. 1678–1687.
  257. M. Kang, G. Yang, Y. Yoo, and C. Yoo, “Tensorexpress: In-network communication scheduling for distributed deep learning,” in 2020 IEEE 13th international conference on cloud computing (CLOUD).   IEEE, 2020, pp. 25–27.
  258. Y. He, W. Cai, P. Zhou, G. Sun, S. Luo, H. Yu, and M. Guizani, “Beamer: Stage-aware coflow scheduling to accelerate hyper-parameter tuning in deep learning clusters,” IEEE Transactions on Network and Service Management, vol. 19, no. 2, pp. 1083–1097, 2021.
  259. C. Chen, S. Wang, Y. Chen, and J. Han, “Tereis: A package-based scheduling in deep learning systems,” in 2022 IEEE 28th International Conference on Parallel and Distributed Systems (ICPADS).   IEEE, 2023, pp. 867–874.
  260. Q. Duan, C. Peng, Z. Wang, Y. Xu, S. Liu, J. Wu, and J. C. Lui, “Accelerating distributed dnn training via transport layer scheduling,” IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 5, pp. 1650–1666, 2023.
  261. H. Zheng, F. Xu, L. Chen, Z. Zhou, and F. Liu, “Cynthia: Cost-efficient cloud resource provisioning for predictable distributed deep neural network training,” in Proceedings of the 48th International Conference on Parallel Processing, 2019, pp. 1–11.
  262. N. B. D. Ta, “f⁢c2𝑓superscript𝑐2fc^{2}italic_f italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: cloud-based cluster provisioning for distributed machine learning,” Cluster Computing, p. 1299–1315, Dec 2019.
  263. A. Jahani, M. Lattuada, M. Ciavotta, D. Ardagna, E. Amaldi, and L. Zhang, “Optimizing on-demand gpus in the cloud for deep learning applications training,” in 2019 4th International Conference on Computing, Communications and Security (ICCCS), Oct 2019.
  264. F. Wang, W. Zhang, S. Lai, M. Hao, and Z. Wang, “Dynamic gpu energy optimization for machine learning training workloads,” IEEE Transactions on Parallel and Distributed Systems, p. 1–1, Jan 2022.
  265. F. Filippini, J. Anselmi, D. Ardagna, and B. Gaujal, “A stochastic approach for scheduling ai training jobs in gpu-based systems,” IEEE Transactions on Cloud Computing, pp. 1–17, 2023.
  266. Z. Chen, W. Quan, M. Wen, J. Fang, J. Yu, C. Zhang, and L. Luo, “Deep learning research and development platform: Characterizing and scheduling with qos guarantees on gpu clusters,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 1, pp. 34–50, 2020.
  267. W. Gao, Z. Ye, P. Sun, Y. Wen, and T. Zhang, “Chronus: A novel deadline-aware scheduler for deep learning training jobs,” in Proceedings of the ACM Symposium on Cloud Computing, 2021, pp. 609–623.
  268. Z. Yang, H. Wu, Y. Xu, Y. Wu, H. Zhong, and W. Zhang, “Hydra: Deadline-aware and efficiency-oriented scheduling for deep learning jobs on heterogeneous gpus,” IEEE Transactions on Computers, vol. 72, no. 8, pp. 2224–2236, 2023.
  269. W. Liu, J. Geng, Z. Zhu, J. Cao, and Z. Lian, “Sniper: Cloud-edge collaborative inference scheduling with neural network similarity modeling,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, ser. DAC ’22, New York, NY, USA, 2022, p. 505–510.
  270. Y. Li, Z. Han, Q. Zhang, Z. Li, and H. Tan, “Automating cloud deployment for deep learning inference of real-time online services,” in IEEE INFOCOM 2020-IEEE Conference on Computer Communications.   IEEE, 2020, pp. 1668–1677.
  271. D. Crankshaw, X. Wang, G. Zhou, M. J. Franklin, J. E. Gonzalez, and I. Stoica, “Clipper: A low-latency online prediction serving system,” in 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), 2017, pp. 613–627.
  272. W. Wang, J. Gao, M. Zhang, S. Wang, G. Chen, T. K. Ng, B. C. Ooi, J. Shao, and M. Reyad, “Rafiki,” Proceedings of the VLDB Endowment, p. 128–140, Oct 2018.
  273. X. Tang, P. Wang, Q. Liu, W. Wang, and J. Han, “Nanily: A qos-aware scheduling for dnn inference workload in clouds,” in 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Aug 2019.
  274. H. Qin, S. Zawad, Y. Zhou, L. Yang, D. Zhao, and F. Yan, “Swift machine learning model serving scheduling: a region based reinforcement learning approach,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1–23.
  275. L. Wang, L. Yang, Y. Yu, W. Wang, B. Li, X. Sun, J. He, and L. Zhang, “Morphling: fast, near-optimal auto-configuration for cloud-native model serving,” in Proceedings of the ACM Symposium on Cloud Computing, 2021, pp. 639–653.
  276. M. Chowdhury and I. Stoica, “Efficient coflow scheduling without prior knowledge,” ACM SIGCOMM Computer Communication Review, vol. 45, no. 4, pp. 393–406, 2015.
  277. S. Tang, Y. Yu, H. Wang, G. Wang, W. Chen, Z. Xu, S. Guo, and W. Gao, “A survey on scheduling techniques in computing and network convergence,” IEEE Communications Surveys & Tutorials, 2023.
  278. A. Nukada, “Performance optimization of allreduce operation for multi-gpu systems,” in 2021 IEEE International Conference on Big Data (Big Data).   IEEE, 2021, pp. 3107–3112.
  279. (2024) Nvlink and nvswitch. [Online]. Available: https://www.nvidia.com/en-us/data-center/nvlink/
  280. (2024) Gpudirect rdma. [Online]. Available: https://developer.nvidia.com/gpudirect
  281. A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. Ports, and P. Richtárik, “Scaling distributed machine learning with in-network aggregation,” in 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), 2021, pp. 785–808.
  282. C. Lao, Y. Le, K. Mahajan, Y. Chen, W. Wu, A. Akella, and M. Swift, “Atp: In-network aggregation for multi-tenant learning,” in 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), 2021, pp. 741–761.
  283. B. Zhao, C. Liu, J. Dong, Z. Cao, W. Nie, and W. Wu, “Enabling switch memory management for distributed training with in-network aggregation,” in IEEE INFOCOM 2023-IEEE Conference on Computer Communications.   IEEE, 2023, pp. 1–10.
  284. J. Fang, G. Zhao, H. Xu, C. Wu, and Z. Yu, “Grid: Gradient routing with in-network aggregation for distributed training,” IEEE/ACM Transactions on Networking, 2023.
  285. J. Fang, G. Zhao, H. Xu, Z. Yu, B. Shen, and L. Xie, “Goat: Gradient scheduling with collaborative in-network aggregation for distributed training,” in 2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS).   IEEE, 2023, pp. 1–10.
  286. N. Gebara, M. Ghobadi, and P. Costa, “In-network aggregation for shared machine learning clusters,” Proceedings of Machine Learning and Systems, vol. 3, pp. 829–844, 2021.
  287. Z. Li, J. Huang, Y. Li, A. Xu, S. Zhou, J. Liu, and J. Wang, “A2tp: Aggregator-aware in-network aggregation for multi-tenant learning,” in Proceedings of the Eighteenth European Conference on Computer Systems, 2023, pp. 639–653.
  288. S. Liu, Q. Wang, J. Zhang, W. Wu, Q. Lin, Y. Liu, M. Xu, M. Canini, R. C. Cheung, and J. He, “In-network aggregation with transport transparency for distributed training,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2023, pp. 376–391.
  289. A. Guo, Y. Hao, C. Wu, P. Haghi, Z. Pan, M. Si, D. Tao, A. Li, M. Herbordt, and T. Geng, “Software-hardware co-design of heterogeneous smartnic system for recommendation models inference and training,” in Proceedings of the 37th International Conference on Supercomputing, 2023, pp. 336–347.
  290. A. Guo, T. Geng, Y. Zhang, P. Haghi, C. Wu, C. Tan, Y. Lin, A. Li, and M. Herbordt, “A framework for neural network inference on fpga-centric smartnics,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL).   IEEE, 2022, pp. 01–08.
  291. (2024) Open mpi. [Online]. Available: https://www.open-mpi.org/
  292. (2024) Facebook gloo. [Online]. Available: https://github.com/facebookincubator/gloo
  293. (2024) Horovod. [Online]. Available: https://horovod.readthedocs.io/en/stable/
  294. (2024) Nccl. [Online]. Available: https://developer.nvidia.com/nccl
  295. M. Cowan, S. Maleki, M. Musuvathi, O. Saarikivi, and Y. Xiong, “Mscclang: Microsoft collective communication language,” in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 502–514.
  296. M. Cho, U. Finkler, D. Kung, and H. Hunter, “Blueconnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy,” Proceedings of Machine Learning and Systems, vol. 1, pp. 241–251, 2019.
  297. G. Wang, S. Venkataraman, A. Phanishayee, N. Devanur, J. Thelin, and I. Stoica, “Blink: Fast and generic collectives for distributed ml,” in Proceedings of Machine Learning and Systems, vol. 2, 2020, pp. 172–186.
  298. L. Luo, P. West, J. Nelson, A. Krishnamurthy, and L. Ceze, “Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud,” Proceedings of Machine Learning and Systems, vol. 2, pp. 82–97, 2020.
  299. C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler, “Sparcml: High-performance sparse communication for machine learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, pp. 1–15.
  300. J. Fei, C.-Y. Ho, A. N. Sahu, M. Canini, and A. Sapio, “Efficient sparse collective communication and its application to accelerate distributed deep learning,” in Proceedings of the 2021 ACM SIGCOMM 2021 Conference, 2021, pp. 676–691.
  301. H. Xu, K. Kostopoulou, A. Dutta, X. Li, A. Ntoulas, and P. Kalnis, “Deepreduce: A sparse-tensor communication framework for federated deep learning,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 21 150–21 163.
  302. Z. Cai, Z. Liu, S. Maleki, M. Musuvathi, T. Mytkowicz, J. Nelson, and O. Saarikivi, “Synthesizing optimal collective algorithms,” in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 62–75.
  303. A. Shah, V. Chidambaram, M. Cowan, S. Maleki, M. Musuvathi, T. Mytkowicz, J. Nelson, O. Saarikivi, and R. Singh, “TACCL: Guiding collective algorithm synthesis using communication sketches,” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23).   Boston, MA: USENIX Association, Apr. 2023, pp. 593–612.
  304. S. Wang, D. Li, Y. Cheng, J. Geng, Y. Wang, S. Wang, S. Xia, and J. Wu, “A scalable, high-performance, and fault-tolerant network architecture for distributed machine learning,” IEEE/ACM Transactions on Networking, vol. 28, no. 4, pp. 1752–1764, 2020.
  305. T. Hoefler, T. Bonato, D. De Sensi, S. Di Girolamo, S. Li, M. Heddes, J. Belk, D. Goel, M. Castro, and S. Scott, “Hammingmesh: A network topology for large-scale deep learning,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.   IEEE, 2022, pp. 1–18.
  306. M. Khani, M. Ghobadi, M. Alizadeh, Z. Zhu, M. Glick, K. Bergman, A. Vahdat, B. Klenk, and E. Ebrahimi, “Sip-ml: high-bandwidth optical network interconnects for machine learning training,” in Proceedings of the 2021 ACM SIGCOMM 2021 Conference, 2021, pp. 657–675.
  307. W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y. Zhang, and A. Kewitsch, “Topoopt: Co-optimizing network topology and parallelization strategy for distributed training jobs,” in 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 739–767.
  308. N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, M. E. Giampapa, P. Heidelberger, S. Singh, B. D. Steinmacher-Burow, T. Takken et al., “Blue gene/l torus interconnection network,” IBM Journal of Research and Development, vol. 49, no. 2.3, pp. 265–276, 2005.
  309. A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker, “Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 1, pp. 94–110, 2019.
  310. S. M. Nabavinejad, M. Baharloo, K.-C. Chen, M. Palesi, T. Kogel, and M. Ebrahimi, “An overview of efficient interconnection networks for deep neural network accelerators,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 10, no. 3, pp. 268–282, 2020.
  311. A. Feng, D. Dong, F. Lei, J. Ma, E. Yu, and R. Wang, “In-network aggregation for data center networks: A survey,” Computer Communications, vol. 198, pp. 63–76, 2023.
  312. M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan, “Data center tcp (dctcp),” in Proceedings of the ACM SIGCOMM 2010 Conference, 2010, pp. 63–74.
  313. M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker, “pfabric: Minimal near-optimal datacenter transport,” ACM SIGCOMM Computer Communication Review, vol. 43, no. 4, pp. 435–446, 2013.
  314. N. Schelten, F. Steinert, A. Schulte, and B. Stabernack, “A high-throughput, resource-efficient implementation of the rocev2 remote dma protocol for network-attached hardware accelerators,” in 2020 International Conference on Field-Programmable Technology (ICFPT).   IEEE, 2020, pp. 241–249.
  315. (2024) Nvidia connectx smartnic. [Online]. Available: https://www.nvidia.com/en-us/networking/ethernet-adapters/
  316. (2024) Intel ipu. [Online]. Available: https://www.intel.com/content/www/us/en/products/details/network-io/ipu.html
  317. L. Liu, P. Zhou, G. Sun, X. Chen, T. Wu, H. Yu, and M. Guizani, “Topologies in distributed machine learning: Comprehensive survey, recommendations and future directions,” Neurocomputing, p. 127009, 2023.
  318. M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” ACM SIGCOMM computer communication review, vol. 38, no. 4, pp. 63–74, 2008.
  319. C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu, “Bcube: a high performance, server-centric network architecture for modular data centers,” in Proceedings of the ACM SIGCOMM 2009 conference on Data communication, 2009, pp. 63–74.
  320. A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey, “Jellyfish: Networking data centers randomly,” in 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), 2012, pp. 225–238.
  321. C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu, “Dcell: a scalable and fault-tolerant network structure for data centers,” in Proceedings of the ACM SIGCOMM 2008 conference on Data communication, 2008, pp. 75–86.
  322. N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Subramanya, Y. Fainman, G. Papen, and A. Vahdat, “Helios: a hybrid electrical/optical switch architecture for modular data centers,” in Proceedings of the ACM SIGCOMM 2010 Conference, 2010, pp. 339–350.
  323. G. Wang, D. G. Andersen, M. Kaminsky, K. Papagiannaki, T. E. Ng, M. Kozuch, and M. Ryan, “c-through: Part-time optics in data centers,” in Proceedings of the ACM SIGCOMM 2010 Conference, 2010, pp. 327–338.
  324. K. Chen, A. Singla, A. Singh, K. Ramachandran, L. Xu, Y. Zhang, X. Wen, and Y. Chen, “Osa: An optical switching architecture for data center networks with unprecedented flexibility,” IEEE/ACM Transactions on networking, vol. 22, no. 2, pp. 498–511, 2013.
  325. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  326. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  327. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
  328. E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” in The Eleventh International Conference on Learning Representations, 2023.
  329. S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large language model for finance,” arXiv preprint arXiv:2303.17564, 2023.
  330. D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro et al., “Efficient large-scale language model training on gpu clusters using megatron-lm,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15.
  331. B. Yuan, Y. He, J. Davis, T. Zhang, T. Dao, B. Chen, P. S. Liang, C. Re, and C. Zhang, “Decentralized training of foundation models in heterogeneous environments,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 464–25 477, 2022.
  332. S. Athlur, N. Saran, M. Sivathanu, R. Ramjee, and N. Kwatra, “Varuna: Scalable, low-cost training of massive deep learning models,” in Proceedings of the Seventeenth European Conference on Computer Systems, New York, NY, USA, 2022, p. 472–487.
  333. S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model,” arXiv preprint arXiv:2201.11990, 2022.
  334. J. Wang, Y. Lu, B. Yuan, B. Chen, P. Liang, C. De Sa, C. Re, and C. Zhang, “CocktailSGD: Fine-tuning foundation models over 500Mbps networks,” in Proceedings of the 40th International Conference on Machine Learning, vol. 202.   PMLR, 23–29 Jul 2023, pp. 36 058–36 076.
  335. M. Ryabinin, T. Dettmers, M. Diskin, and A. Borzunov, “Swarm parallelism: Training large models can be surprisingly communication-efficient,” in Proceedings of the 40th International Conference on Machine Learning.   JMLR, 2023.
  336. I. Jang, Z. Yang, Z. Zhang, X. Jin, and M. Chowdhury, “Oobleck: Resilient distributed training of large models using pipeline templates,” in Proceedings of the 29th Symposium on Operating Systems Principles, New York, NY, USA, 2023, p. 382–395.
  337. Y. Shen, J. Shao, X. Zhang, Z. Lin, H. Pan, D. Li, J. Zhang, and K. B. Letaief, “Large language models empowered autonomous edge ai for connected intelligence,” IEEE Communications Magazine, pp. 1–7, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Feng Liang (61 papers)
  2. Zhen Zhang (384 papers)
  3. Haifeng Lu (4 papers)
  4. Victor C. M. Leung (115 papers)
  5. Yanyi Guo (2 papers)
  6. Xiping Hu (46 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com