Accelerating Geo-distributed Machine Learning with Network-Aware Adaptive Tree and Auxiliary Route (2404.11352v1)
Abstract: Distributed machine learning is becoming increasingly popular for geo-distributed data analytics, facilitating the collaborative analysis of data scattered across data centers in different regions. This paradigm eliminates the need for centralizing sensitive raw data in one location but faces the significant challenge of high parameter synchronization delays, which stems from the constraints of bandwidth-limited, heterogeneous, and fluctuating wide-area networks. Prior research has focused on optimizing the synchronization topology, evolving from starlike to tree-based structures. However, these solutions typically depend on regular tree structures and lack an adequate topology metric, resulting in limited improvements. This paper proposes NetStorm, an adaptive and highly efficient communication scheduler designed to speed up parameter synchronization across geo-distributed data centers. First, it establishes an effective metric for optimizing a multi-root FAPT synchronization topology. Second, a network awareness module is developed to acquire network knowledge, aiding in topology decisions. Third, a multipath auxiliary transmission mechanism is introduced to enhance network awareness and facilitate multipath transmissions. Lastly, we design policy consistency protocols to guarantee seamless updates of transmission policies. Empirical results demonstrate that NetStorm significantly outperforms distributed training systems like MXNET, MLNET, and TSEngine, with a speedup of 6.5~9.2 times over MXNET.
- L. Chen, S. Liu, B. Li et al., “Scheduling jobs across geo-distributed datacenters with max-min fairness,” IEEE TNSE, vol. 6, no. 3, pp. 488–500, 2018.
- Z. Niu, B. He, A. C. Zhou et al., “Multi-objective optimizations in geo-distributed data analytics systems,” in ICPADS. IEEE, 2017, pp. 519–528.
- A. C. Zhou, Y. Xiao, Y. Gong et al., “Privacy regulation aware process mapping in geo-distributed cloud data centers,” IEEE TPDS, vol. 30, no. 8, pp. 1872–1888, 2019.
- H. Zhou, W. Cai, Z. Li et al., “Tsengine: Enable efficient communication overlay in distributed machine learning in wans,” IEEE TNSM, vol. 18, no. 4, pp. 4846–4859, 2021.
- B. Yuan, Y. He, J. Davis et al., “Decentralized training of foundation models in heterogeneous environments,” in NeurIPS, vol. 35, 2022, pp. 25 464–25 477.
- Y. Ren, X. Wu, L. Zhang et al., “irdma: Efficient use of rdma in distributed deep learning systems,” in HPCC/SmartCity/DSS. IEEE, 2017, pp. 231–238.
- J. Xue, Y. Miao, C. Chen et al., “Fast distributed deep learning over rdma,” in EuroSys, 2019, pp. 1–14.
- Y. Li, C. Fan, X. Zhang et al., “Placement of parameter server in wide area network topology for geo-distributed machine learning,” JCN, vol. 25, no. 3, pp. 370–380, 2023.
- L. Fan, X. Zhang, Y. Zhao et al., “Online training flow scheduling for geo-distributed machine learning jobs over heterogeneous and dynamic networks,” IEEE TCCN, vol. 10, no. 1, pp. 277–291, 2024.
- C. Fan, X. Zhang, Y. Zhao et al., “Self-adaptive gradient quantization for geo-distributed machine learning over heterogeneous and dynamic networks,” IEEE TCC, vol. 11, no. 4, pp. 3483–3496, 2023.
- H. Mi, K. Xu, D. Feng et al., “Collaborative deep learning across multiple data centers,” Science China Information Sciences, vol. 63, pp. 1–11, 2020.
- I. Cano, M. Weimer, D. Mahajan et al., “Towards geo-distributed machine learning,” arXiv preprint arXiv:1603.09035, pp. 1–10, 2016.
- H. Zhou, Z. Li, H. Yu et al., “Nbsync: Parallelism of local computing and global synchronization for fast distributed machine learning in wans,” IEEE TSC, vol. 16, no. 6, pp. 4115–4127, 2023.
- X. Li, R. Zhou, L. Jiao et al., “Online placement and scaling of geo-distributed machine learning jobs via volume-discounting brokerage,” IEEE TPDS, vol. 31, no. 4, pp. 948–966, 2019.
- X. Lyu, C. Ren, W. Ni et al., “Optimal online data partitioning for geo-distributed machine learning in edge of wireless networks,” IEEE JSAC, vol. 37, no. 10, pp. 2393–2406, 2019.
- M. Li, D. G. Andersen, J. W. Park et al., “Scaling distributed machine learning with the parameter server,” in OSDI, 2014, pp. 583–598.
- K. Hsieh, A. Harlap, N. Vijaykumar et al., “Gaia: Geo-distributed machine learning approaching lan speeds,” in NSDI, 2017, pp. 629–647.
- Z. Li, H. Zhou, T. Zhou et al., “Esync: Accelerating intra-domain federated learning in heterogeneous data centers,” IEEE TSC, vol. 15, no. 4, pp. 2261–2274, 2020.
- J. Geng, D. Li, Y. Cheng et al., “Hips: Hierarchical parameter synchronization in large-scale distributed machine learning,” in NetAI, 2018, pp. 1–7.
- L. Mai, C. Hong, and P. Costa, “Optimizing network performance in distributed machine learning,” in HotCloud, 2015, pp. 1–7.
- X. Wan, H. Zhang, H. Wang et al., “Rat: Resilient allreduce tree for distributed machine learning,” in APNET, 2020, pp. 52–57.
- D. Yang, W. Zhang, Q. Ye et al., “Detfed: Dynamic resource scheduling for deterministic federated learning over time-sensitive networks,” IEEE TMC, pp. 1–17, 2023.
- L. Liu, H. Yu, and G. Sun, “Reconfigurable aggregation tree for distributed machine learning in optical wan,” in ICAML. IEEE, 2021, pp. 206–210.
- A. Sapio, M. Canini, C.-Y. Ho et al., “Scaling distributed machine learning with in-network aggregation,” in NSDI, 2021, pp. 785–808.
- G. Wang, S. Venkataraman, A. Phanishayee et al., “Blink: Fast and generic collectives for distributed ml,” in MLSys, vol. 2, 2020, pp. 172–186.
- Z. Zhang, C. Wu, and Z. Li, “Near-optimal topology-adaptive parameter synchronization in distributed dnn training,” in IEEE INFOCOM. IEEE, 2021, pp. 1–10.
- S. Jeaugey, “Massively scale your deep learning training with nccl 2.4.” [Online]. Available: https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4/
- J. Huang, P. Majumder, S. Kim et al., “Communication algorithm-architecture co-design for distributed deep learning,” in IEEE/ACM ISCA. IEEE, 2021, pp. 181–194.
- A. Reisizadeh, S. Prakash, R. Pedarsani et al., “Codedreduce: A fast and robust framework for gradient aggregation in distributed learning,” IEEE/ACM TON, vol. 30, no. 1, pp. 148–161, 2021.
- T. Ma, L. Luo, H. Yu et al., “Klonet: An easy-to-use and scalable platform for computer networks education,” in NSDI, 2024, pp. 1–15.
- T. Chen, M. Li, Y. Li et al., “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” in NeurIPS, 2016, pp. 1–6.
- M. Abadi, P. Barham, J. Chen et al., “Tensorflow: A system for large-scale machine learning,” in OSDI, 2016, pp. 265–283.
- L. Luo, P. West, A. Krishnamurthy et al., “Plink: Discovering and exploiting datacenter network locality for efficient cloud-based distributed training,” in MLSys, 2020, pp. 1–16.
- P. Zhou, Q. Lin, D. Loghin et al., “Communication-efficient decentralized machine learning over heterogeneous networks,” in ICDE. IEEE, 2021, pp. 384–395.