TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning (2304.05301v3)
Abstract: The surge of artificial intelligence, particularly LLMs, has driven the rapid development of large-scale machine learning clusters. Executing distributed models on these clusters is often constrained by communication overhead, making efficient utilization of available network resources crucial. As a result, the routing algorithm employed for collective communications (i.e., collective algorithms) plays a pivotal role in determining overall performance. Unfortunately, existing collective communication libraries for distributed machine learning are limited by a fixed set of basic collective algorithms. This limitation hinders communication optimization, especially in modern clusters with heterogeneous and asymmetric topologies. Furthermore, manually designing collective algorithms for all possible combinations of network topologies and collective patterns requires heavy engineering and validation efforts. To address these challenges, this paper presents TACOS, an autonomous synthesizer capable of automatically generating topology-aware collective algorithms tailored to specific collective patterns and network topologies. TACOS is highly flexible, synthesizing an All-Reduce algorithm for a heterogeneous 128-NPU system in just 1.08 seconds, while achieving up to a 4.27x performance improvement over state-of-the-art synthesizers. Additionally, TACOS demonstrates better scalability with polynomial synthesis times, in contrast to NP-hard approaches which only scale to systems with tens of NPUs. TACOS can synthesize for 40K NPUs in just 2.52 hours.
- N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. Patterson, “A domain-specific supercomputer for training deep neural networks,” Commun. ACM, vol. 63, no. 7, p. 67–78, 2020.
- NVIDIA, “Nvidia h100 tensore core gpu,” https://www.nvidia.com/en-us/data-center/h100, 2022.
- Intel, “Gaudi training platfrom white paper,” https://habana.ai/wp-content/uploads/2019/06/Habana-Gaudi-Training-Platform-whitepaper.pdf, 2019.
- Cerebras Systems, “Cerebras Systems: Achieving industry best ai performance through a systems approach,” https://cerebras.net/wp-content/uploads/2021/04/Cerebras-CS-2-Whitepaper.pdf, 2021.
- T. P. Morgan, “Inside tesla’s innovative and homegrown dojo ai supercomputer,” https://www.nextplatform.com/2022/08/23/inside-teslas-innovative-and-homegrown -dojo-ai-supercomputer, 2022.
- D. Salvator, “Acing the Test: Nvidia turbocharges generative ai training in mlperf benchmarks,” https://blogs.nvidia.com/blog/scaling-ai-training-mlperf, 2023.
- S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: memory optimizations toward training trillion parameter models,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’20), 2020.
- NVIDIA, “NVIDIA DGX SuperPOD: Instant infrastructure for ai leadership,” https://resources.nvidia.com/en-us-auto-datacenter/nvpod-superpod-wp-09, 2020.
- D. Patel, “Tenstorrent wormhole analysis - a scale out architecture for machine learning that could put nvidia on their back foot,” https://www.semianalysis.com/p/tenstorrent-wormhole-analysis-a-scale, 2021.
- NVIDIA, “Nvidia nvlink high-speed gpu interconnect,” https://www.nvidia.com/en-us/design-visualization/nvlink-bridges, 2022.
- I. Cutress, “Analyzing Intel’s Discrete Xe-HPC Graphics Disclosure: Ponte vecchio, rambo cache, and gelato,” https://www.anandtech.com/show/15188/analyzing-intels-discrete-xe-hpc-graphics-disclosure-ponte-vecchio, 2019.
- AMD, “Amd infinity architecture,” https://www.amd.com/en/technologies/infinity-architecture, 2023.
- NVIDIA, “Infiniband networking solutions,” https://www.nvidia.com/en-us/networking/products/infiniband, 2024.
- A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. Ports, and P. Richtarik, “Scaling distributed machine learning with in-network aggregation,” in Proceedings of the 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’21), 2021, pp. 785–808.
- Y. Li, I.-J. Liu, Y. Yuan, D. Chen, A. Schwing, and J. Huang, “Accelerating distributed reinforcement learning with in-switch computing,” in Proceedings of the 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA ’19), 2019, pp. 279–291.
- S. Rashidi, W. Won, S. Srinivasan, S. Sridharan, and T. Krishna, “Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models,” in Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22), 2022, p. 581–596.
- A. Shah, V. Chidambaram, M. Cowan, S. Maleki, M. Musuvathi, T. Mytkowicz, J. Nelson, O. Saarikivi, and R. Singh, “TACCL: Guiding collective algorithm synthesis using communication sketches,” in Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’23), 2023, pp. 593–612.
- A. Jangda, J. Huang, G. Liu, A. H. N. Sabet, S. Maleki, Y. Miao, M. Musuvathi, T. Mytkowicz, and O. Saarikivi, “Breaking the computation and communication abstraction barrier in distributed machine learning workloads,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’22), 2022, p. 402–416.
- E. Chan, R. van de Geijn, W. Gropp, and R. Thakur, “Collective communication on architectures that support simultaneous communication over multiple links,” in Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’06), 2006, p. 2–11.
- S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “ASTRA-SIM: Enabling sw/hw co-design exploration for distributed dl training platforms,” in Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’20), 2020, pp. 81–92.
- R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in mpich,” Int. J. High Perform. Comput. Appl., vol. 19, no. 1, p. 49–66, 2005.
- S. Jeaugey, “Massively scale your deep learning training with nccl 2.4,” https://developer.nvidia.com/blog/massively-scale-deep-learning-training-nccl-2-4, 2019.
- M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter, “BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy,” IBM Journal of Research and Development, vol. 63, no. 6, pp. 1–11, 2019.
- S. Cho, H. Son, and J. Kim, “Logical/physical topology-aware collective communication in deep learning training,” in Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA ’23), 2023, pp. 56–68.
- J. Huang, P. Majumder, S. Kim, A. Muzahid, K. H. Yum, and E. J. Kim, “Communication algorithm-architecture co-design for distributed deep learning,” in Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA ’21), 2021, pp. 181–194.
- G. Wang, S. Venkataraman, A. Phanishayee, N. Devanur, J. Thelin, and I. Stoica, “Blink: Fast and generic collectives for distributed ml,” in Proceedings of the 2020 Machine Learning and Systems (MLSys ’20), vol. 2, 2020, pp. 172–186.
- T. Khan, S. Rashidi, S. Sridharan, P. Shurpali, A. Akella, and T. Krishna, “Impact of roce congestion control policies on distributed training of dnns,” in 2022 IEEE Symposium on High-Performance Interconnects (HOTI ’22), 2022, pp. 39–48.
- R. Majumder and J. Wang, “DeepSpeed: Extreme-scale model training for everyone,” https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone, 2020.
- J. Chen, S. Li, R. Gun, J. Yuan, and T. Hoefler, “AutoDDL: Automatic distributed deep learning with asymptotically optimal communication,” in arXiv:2301.06813 [cs.DC], 2023.
- NVIDIA, “Nvidia collective communication library (nccl),” https://developer.nvidia.com/nccl, 2017.
- AMD, “Rccl 2.18.3 documentation,” https://rocm.docs.amd.com/projects/rccl/en/latest, 2023.
- Microsoft, “Microsoft collective communication library,” https://github.com/microsoft/msccl, 2023.
- Intel, “Intel oneapi collective communications library,” https://oneapi-src.github.io/oneCCL, 2021.
- E. Gabrielyan and R. Hersch, “Network topology aware scheduling of collective communications,” in Proceedings of the 10th International Conference on Telecommunications (ICT ’03), 2003, pp. 1051–1058.
- W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “ASTRA-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,” in Proceedings of the 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’23), 2023, pp. 283–294.
- E. Köhler, K. Langkau, and M. Skutella, “Time-expanded graphs for flow-dependent transit times,” in Algorithms — ESA 2002, 2002, pp. 599–611.
- S. Belieres, M. Hewitt, N. Jozefowiez, and F. Semet, “A time-expanded network reduction matheuristic for the logistics service network design problem,” Transportation Research Part E: Logistics and Transportation Review, vol. 147, 2021.
- A. Tafreshian, M. Abdolmaleki, N. Masoud, and H. Wang, “Proactive shuttle dispatching in large-scale dynamic dial-a-ride systems,” Transportation Research Part B: Methodological, vol. 150, pp. 227–259, 2021.
- A. Paulus, M. Rolínek, V. Musil, B. Amos, and G. Martius, “CombOptNet: Fit the right np-hard problem by learning integer programming constraints,” in Proceedings of the 38th International Conference on Machine Learning (ICML ’21), vol. 139, 2021, pp. 8443–8453.
- B. Klenk, N. Jiang, G. Thorson, and L. Dennison, “An in-network architecture for accelerating shared-memory multiprocessor collectives,” in Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA ’20), 2020, pp. 996–1009.
- M. Ott, S. Shleifer, M. Xu, P. Goyal, Q. Duval, and V. Caggiano, “Fully Sharded Data Parallel: faster ai training with fewer gpus,” https://engineering.fb.com/2021/07/15/open-source/fsdp, 2021.
- K. Kandalla, H. Subramoni, A. Vishnu, and D. K. Panda, “Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with scatter and gather,” in Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW ’10), 2010.
- N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. A. Patterson, “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” in Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA ’23), 2023.
- B. Wu, L. Xia, Q. Li, K. Li, X. Chen, Y. Guo, T. Xiang, Y. Chen, and S. Li, “TRANSOM: An efficient fault-tolerant system for training llms,” in arXiv:2310.10046 [cs.DC], 2023.
- T. He, X. Li, Z. Wang, K. Qian, J. Xu, W. Yu, and J. Zhou, “Unicron: Economizing self-healing llm training at scale,” in arXiv:2401.00134 [cs.DC], 2023.
- Y. Wang, S. Shi, X. He, Z. Tang, X. Pan, Y. Zheng, X. Wu, A. C. Zhou, B. He, and X. Chu, “Reliable and efficient in-memory fault tolerance of large language model pretraining,” in arXiv:2310.12670 [cs.DC], 2023.
- Y. Chen, Q. Yang, S. He, Z. Shi, J. Chen, and M. Guizani, “Ftpipehd: A fault-tolerant pipeline-parallel distributed training approach for heterogeneous edge devices,” IEEE Transactions on Mobile Computing, vol. 23, no. 4, pp. 3200–3212, 2024.
- D. Bouhata, H. Moumen, J. A. Mazari, and A. Bounceur, “Byzantine Fault Tolerance in Distributed Machine Learning: a survey,” in arXiv:2205.02572 [cs.DC], 2022.
- I. Jang, Z. Yang, Z. Zhang, X. Jin, and M. Chowdhury, “Oobleck: Resilient distributed training of large models using pipeline templates,” in Proceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23), 2023.
- S. Singh, O. Ruwase, A. A. Awan, S. Rajbhandari, Y. He, and A. Bhatele, “A hybrid tensor-expert-data parallelism approach to optimize mixture-of-experts training,” in Proceedings of the 37th International Conference on Supercomputing (ICS ’23), 2023, p. 203–214.
- D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. Gibbons, and M. Zaharia, “PipeDream: Generalized pipeline parallelism for dnn training,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19), 2019.
- S. Kumar and N. Jouppi, “Highly available data parallel ml training on mesh networks,” in arXiv:2011.03605 [cs.LG], 2020.
- J. Ma, D. Dong, C. Li, K. Wu, and L. Xiao, “PAARD: Proximity-aware all-reduce communication for dragonfly networks,” in Proceedings of the 2021 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom ’21), 2021, pp. 255–262.
- A. Margolin and A. Barak, “Tree-based fault-tolerant collective operations for mpi,” Concurrency and Computation: Practice and Experience, vol. 33, no. 14, p. e5826, 2021.
- R. W. Hockney, “The communication challenge for MPP: Intel paragon and meiko cs-2,” Parallel Computing, vol. 20, no. 3, pp. 389–398, 1994.
- S. Rashidi, M. Denton, S. Sridharan, S. Srinivasan, A. Suresh, J. Nie, and T. Krishna, “Enabling compute-communication overlap in distributed deep learning training platforms,” in Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA ’21), 2021, pp. 540–553.
- Microsoft, “TACCL: Guiding collective algorithm synthesis using communication sketches,” https://github.com/microsoft/taccl, 2023.
- W. Won, S. Rashidi, S. Srinivasan, and T. Krishna, “Exploring multi-dimensional hierarchical network topologies for efficient distributed training of trillion parameter dl models,” 2021.
- Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s Neural Machine Translation System: Bridging the gap between human and machine translation,” in arXiv:1609.08144 [cs.CL], 2016.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’16), 2016, pp. 770–778.
- C. Rosset, “Turing-NLG: A 17-billion-parameter language model by microsoft,” https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft, 2020.
- M. Cowan, S. Maleki, M. Musuvathi, O. Saarikivi, and Y. Xiong, “GC3: An optimizing compiler for gpu collective communication,” in arXiv:2201.11840 [cs.DC], 2022.
- M. Ferrati and L. Pallottino, “A time expanded network based algorithm for safe and efficient distributed multi-agent coordination,” in Proceedings of the 52nd IEEE Conference on Decision and Control (CDC ’13), 2013, pp. 2805–2810.
- E. Köhler and M. Strehler, “Combining static and dynamic models for traffic signal optimization inherent load-dependent travel times in a cyclically time-expanded network model,” Procedia - Social and Behavioral Sciences, vol. 54, pp. 1125–1134, 2012.
- T. T. Nguyen and M. Wahib, “An allreduce algorithm and network co-design for large-scale training of distributed deep learning,” in Proceedings of the 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid ’21), 2021, pp. 396–405.
- L. Zhao, S. Pal, T. Chugh, W. Wang, J. Fantl, P. Basu, J. Khoury, and A. Krishnamurthy, “Efficient direct-connect topologies for collective communications,” in arXiv:2202.03356 [cs.NI], 2023.
- W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y. Zhang, and A. Kewitsch, “TopoOpt: Co-optimizing network topology and parallelization strategy for distributed training jobs,” in Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’23), 2023, pp. 739–767.