Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LIBRA: Enabling Workload-aware Multi-dimensional Network Topology Optimization for Distributed Training of Large AI Models (2109.11762v2)

Published 24 Sep 2021 in cs.DC and cs.LG

Abstract: As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the exchange of gradients and activations, which become the critical bottleneck of the end-to-end training process. In this work, we motivate the design of multi-dimensional networks within machine learning systems as a cost-efficient mechanism to enhance overall network bandwidth. We also identify that optimal bandwidth allocation is pivotal for multi-dimensional networks to ensure efficient resource utilization. We introduce LIBRA, a framework specifically focused on optimizing multi-dimensional fabric architectures. Through case studies, we demonstrate the value of LIBRA, both in architecting optimized fabrics under diverse constraints and in enabling co-optimization opportunities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. OpenAI, “Introducing chatgpt,” https://openai.com/blog/chatgpt, 2022.
  2. Google DeepMind, “Welcome to the gemini era,” https://deepmind.google/technologies/gemini, 2023.
  3. Microsoft, “Microsoft Copilot: Your everyday ai companion,” https://copilot.microsoft.com, 2023.
  4. Google, “Cloud tpu,” https://cloud.google.com/tpu, 2021.
  5. Intel, “Habana,” https://habana.ai, 2021.
  6. K. Lee and S. Sengupta, “Introducing the ai research supercluster – meta’s cutting-edge ai supercomputer for ai research,” https://ai.facebook.com/blog/ai-rsc, 2022.
  7. T. P. Morgan, “Inside tesla’s innovative and homegrown ”dojo” ai supercomputer,” https://www.nextplatform.com/2022/08/23/inside-teslas-innovative-and-homegrown-dojo-ai-supercomputer, 2022.
  8. Cerebras Systems, “Cerebras Systems: Achieving industry best ai performance through a systems approach,” https://cerebras.net/wp-content/uploads/2021/04/Cerebras-CS-2-Whitepaper.pdf, 2021.
  9. D. Patel, “Tenstorrent wormhole analysis - a scale out architecture for machine learning that could put nvidia on their back foot,” https://www.semianalysis.com/p/tenstorrent-wormhole-analysis-a-scale, 2021.
  10. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA ’17), 2017, p. 1–12.
  11. Intel, “Gaudi training platfrom white paper,” https://habana.ai/wp-content/uploads/2019/06/Habana-Gaudi-Training-Platform-whitepaper.pdf, 2019.
  12. Videocardz, “Tesla d1 chip features 50 billion transistors, scales up to 1.1 exaflops with exapod,” https://videocardz.com/newz/tesla-d1-chip-features-50-billion-transistors-scales-up-to-1-1-exaflops-with-exapod, 2021.
  13. D. Mudigere, Y. Hao, J. Huang, Z. Jia, A. Tulloch, S. Sridharan, X. Liu, M. Ozdal, J. Nie, J. Park, L. Luo, J. A. Yang, L. Gao, D. Ivchenko, A. Basant, Y. Hu, J. Yang, E. K. Ardestani, X. Wang, R. Komuravelli, C.-H. Chu, S. Yilmaz, H. Li, J. Qian, Z. Feng, Y. Ma, J. Yang, E. Wen, H. Li, L. Yang, C. Sun, W. Zhao, D. Melts, K. Dhulipala, K. Kishore, T. Graf, A. Eisenman, K. K. Matam, A. Gangidi, G. J. Chen, M. Krishnan, A. Nayak, K. Nair, B. Muthiah, M. khorashadi, P. Bhattacharya, P. Lapukhov, M. Naumov, A. Mathews, L. Qiao, M. Smelyanskiy, B. Jia, and V. Rao, “Software-hardware co-design for fast and scalable training of deep learning recommendation models,” in Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA ’22), 2022, p. 993–1011.
  14. M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recommendation model for personalization and recommendation systems,” in arXiv:1906.00091 [cs.IR], 2019.
  15. V. Sze, Y.-H. Chen, J. Emer, A. Suleiman, and Z. Zhang, “Hardware for machine learning: Challenges and opportunities,” in Proceedings of the 2017 IEEE Custom Integrated Circuits Conference (CICC’ 17).   Ieee, 2017, pp. 1–8.
  16. T. Krishna, H. Kwon, A. Parashar, M. Pellauer, and A. Samajdar, “Data orchestration in deep learning accelerators,” 2020.
  17. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS ’20), vol. 33, 2020, pp. 1877–1901.
  18. Meta, “Fully Sharded Data Parallel: faster ai training with fewer gpus,” https://engineering.fb.com/2021/07/15/open-source/fsdp, 2021.
  19. B. Klenk, N. Jiang, G. Thorson, and L. Dennison, “An in-network architecture for accelerating shared-memory multiprocessor collectives,” in Proceedings of the 47th Annual International Symposium on Computer Architecture (ISCA ’20), 2020, pp. 996–1009.
  20. Y. Li, I.-J. Liu, Y. Yuan, D. Chen, A. Schwing, and J. Huang, “Accelerating distributed reinforcement learning with in-switch computing,” in Proceedings of the 46th International Symposium on Computer Architecture (ISCA ’19), 2019, p. 279–291.
  21. A. Sapio, M. Canini, C. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. R. K. Ports, and P. Richtárik, “Scaling distributed machine learning with in-network aggregation,” in arXiv:1903.06701 [cs.DC], 2019.
  22. H. Mikami, H. Suganuma, P. U-chupala, Y. Tanaka, and Y. Kageyama, “Massively Distributed SGD: Imagenet/resnet-50 training in a flash,” in arXiv:1811.05233 [cs.LG], 2019.
  23. S. Rashidi, M. Denton, S. Sridharan, S. Srinivasan, A. Suresh, J. Nie, and T. Krishna, “Enabling compute-communication overlap in distributed deep learning training platforms,” in Proceedings of the 48th Annual International Symposium on Computer Architecture (ISCA ’21), 2021, pp. 540–553.
  24. I. Cutress, “Analyzing Intel’s Discrete Xe-HPC Graphics Disclosure: Ponte vecchio, rambo cache, and gelato,” https://www.anandtech.com/show/15188/analyzing-intels-discrete-xe-hpc-graphics-disclosure-ponte-vecchio/5, 2019.
  25. NVIDIA, “Nvlink and nvswitch,” https://www.nvidia.com/en-us/data-center/nvlink, 2022.
  26. ——, “Connectx nics,” https://www.nvidia.com/en-in/networking/ethernet-adapters, 2021.
  27. ——, “The nvidia quantum infiniband platform,” https://www.nvidia.com/en-us/networking/products/infiniband, 2023.
  28. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” in arXiv:2001.08361 [cs.LG], 2020.
  29. Cisco, “Co-packaged optics and an open ecosystem,” https://blogs.cisco.com/sp/co-packaged-optics-and-an-open-ecosystem, 2021.
  30. A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa, A. Jaleel, C.-J. Wu, and D. Nellans, “MCM-GPU: Multi-chip-module gpus for continued performance scalability,” in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’ 17), 2017, pp. 320–332.
  31. Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y. Zhang, W. J. Dally, J. Emer, C. T. Gray, B. Khailany, and S. W. Keckler, “Simba: Scaling deep-learning inference with multi-chip-module-based architecture,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’52), 2019, p. 14–27.
  32. S. Pal, D. Petrisko, M. Tomei, P. Gupta, S. S. Iyer, and R. Kumar, “Architecting waferscale processors - a gpu case study,” in Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA ’19), 2019, pp. 250–263.
  33. N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. A. Patterson, “TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” in Proceedings of the 50th Annual International Symposium on Computer Architecture (ISCA ’23), 2023.
  34. Lightmatter, “Passage: A wafer-scale, programmable photonic interconnect,” https://lightmatter.co/products/passage, 2023.
  35. Intel, “Intel data center gpu max series technical overview,” https://www.intel.com/content/www/us/en/developer/articles/technical/intel-data-center-gpu-max-series-overview.html, 2023.
  36. NVIDIA, “NVIDIA DGX-2: The world’s most powerful deep learning system for the most complex ai challenges,” hhtps://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/dgx-2-datasheet-us-nvidia-955420-r2-web-new.pdf, 2019.
  37. ——, “Nvidia a100 tensor core gpu,” https://www.nvidia.com/en-us/data-center/a100, 2021.
  38. ——, “Nvidia h100 tensore core gpu,” https://www.nvidia.com/en-us/data-center/h100, 2022.
  39. S. Rashidi, W. Won, S. Srinivasan, S. Sridharan, and T. Krishna, “Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models,” in Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA’ 22), 2022, p. 581–596.
  40. S. Coll, E. Frachtenberg, F. Petrini, A. Hoisie, and L. Gurvits, “Using multirail networks in high-performance clusters,” in Proceedings of the 2001 IEEE International Conference on Cluster Computing (CLUSTER ’01), 2001, pp. 15–24.
  41. N. Wolfe, M. Mubarak, N. Jain, J. Domke, A. Bhatele, C. D. Carothers, and R. B. Ross, “Preliminary performance analysis of multi-rail fat-tree networks,” in Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID ’17), 2017, pp. 258–261.
  42. R. Smith, “Spotted At Hot Chips: Quad tile intel xe-hp gpu,” https://www.anandtech.com/show/15996/spotted-at-hot-chips-quad-tile-intel-xehp-gpu, 2020.
  43. S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory optimizations toward training trillion parameter models,” in arXiv:1910.02054 [cs.LG], 2020.
  44. J. Verbraeken, M. Wolting, J. Katzy, J. Kloppenburg, T. Verbelen, and J. S. Rellermeyer, “A survey on distributed machine learning,” ACM Comput. Surv., vol. 53, no. 2, 2020.
  45. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism,” in arXiv:1909.08053 [cs.CL], 2019.
  46. S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “ASTRA-SIM: Enabling sw/hw co-design exploration for distributed dl training platforms,” in Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’20), 2020, pp. 81–92.
  47. NVIDIA, “Nvidia collective communication library (nccl),” https://developer.nvidia.com/nccl, 2017.
  48. Intel, “Intel oneapi collective communications library,” https://www.intel.com/content/www/us/en/developer/tools/oneapi/oneccl.html, 2020.
  49. R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in mpich,” Int. J. High Perform. Comput. Appl., vol. 19, no. 1, pp. 49–66, 2005.
  50. M. Cho, U. Finkler, M. Serrano, D. Kung, and H. Hunter, “BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy,” IBM J. Res. Dev., vol. 63, no. 6, pp. 1–11, 2019.
  51. S. A. Mojumder, M. S. Louis, Y. Sun, A. K. Ziabari, J. L. Abellàn, J. Kim, D. Kaeli, and A. Joshi, “Profiling dnn workloads on a volta-based dgx-1 system,” in Proceedings of the 2018 IEEE International Symposium on Workload Characterization (IISWC ’18), 2018, pp. 122–133.
  52. NVIDIA, “NVIDIA DGX SuperPOD: Instant infrastructure for ai leadership,” https://resources.nvidia.com/en-us-auto-datacenter/nvpod-superpod-wp-09, 2020.
  53. G. Wang, S. Venkataraman, A. Phanishayee, N. Devanur, J. Thelin, and I. Stoica, “Blink: Fast and generic collectives for distributed ml,” in Proceedings of the 2020 Machine Learning and Systems (MLSys ’20), vol. 2, 2020, pp. 172–186.
  54. W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna, “ASTRA-sim2.0: Modeling hierarchical networks and disaggregated systems for large-model training at scale,” in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’23), 2023.
  55. J. Chen, S. Li, R. Gun, J. Yuan, and T. Hoefler, “AutoDDL: Automatic distributed deep learning with asymptotically optimal communication,” in arXiv:2301.06813 [cs.DC], 2023.
  56. R. L. Graham, D. Bureddy, P. Lui, H. Rosenstock, G. Shainer, G. Bloch, D. Goldenerg, M. Dubman, S. Kotchubievsky, V. Koushnir, L. Levi, A. Margolin, T. Ronen, A. Shpiner, O. Wertheim, and E. Zahavi, “Scalable hierarchical aggregation protocol (SHArP): a hardware architecture for efficient data reduction,” in Proceedings of the First Workshop on Optimization of Communication in HPC (COM-HPC ’16), 2016, p. 1–10.
  57. W. Wang, M. Khazraee, Z. Zhong, M. Ghobadi, Z. Jia, D. Mudigere, Y. Zhang, and A. Kewitsch, “TopoOpt: Co-optimizing network topology and parallelization strategy for distributed training jobs,” in Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’23), 2023, pp. 739–767.
  58. M. Khani, M. Ghobadi, M. Alizadeh, Z. Zhu, M. S. Glick, K. Bergman, A. Vahdat, B. Klenk, and E. Ebrahimi, “TeraRack: A tbps rack for machine learning training,” 2020.
  59. Gurobi Optimization, “Gurobi optimizer,” https://www.gurobi.com/solutions/gurobi-optimizer, 2023.
  60. ASTRA-sim, “Astra-sim validation,” https://astra-sim.github.io/astra-sim-docs/validation/validation.html, 2023.
  61. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS ’17), vol. 30, 2017.
  62. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’16), 2016, pp. 770–778.
  63. W. Won, M. Elavazhagan, S. Srinivasan, A. Durg, S. Gupta, and T. Krishna, “TACOS: Topology-aware collective algorithm synthesizer for distributed training,” in arXiv:2304.05301 [cs.DC], 2023.
  64. M. Wagh, “Introducing Compute Express Link (CXL) 3.0: Expanding fabric capabilities and management,” https://www.openfabrics.org/wp-content/uploads/2023-workshop/2023-workshop-presentations/day-2/202_MWagh.pdf, 2023.
  65. S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y. Aminabadi, A. A. Awan, J. Rasley, and Y. He, “DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation ai scale,” in arXiv:2201.05596 [cs.LG], 2022.
  66. R. Alverson, D. Roweth, and L. Kaplan, “The gemini system interconnect,” in Proceedings of the 18th IEEE Symposium on High Performance Interconnects (HOTI ’10), 2010, pp. 83–87.
  67. D. Chen, N. Eisley, P. Heidelberger, S. Kumar, A. Mamidala, F. Petrini, R. Senger, Y. Sugawara, R. Walkup, B. Steinmacher-Burow, A. Choudhury, Y. Sabharwal, S. Singhal, and J. J. Parker, “Looking under the hood of the ibm blue gene/q network,” in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12), 2012.
  68. Y. Ajima, S. Sumimoto, and T. Shimizu, “Tofu: A 6d mesh/torus interconnect for exascale computers,” Computer, vol. 42, no. 11, pp. 36–40, 2009.
  69. A. J. Peña, R. G. C. Carvalho, J. Dinan, P. Balaji, R. Thakur, and W. Gropp, “Analysis of topology-dependent mpi performance on gemini networks,” in Proceedings of the 20th European MPI Users’ Group Meeting (EuroMPI ’13), 2013, p. 61–66.
  70. M. J. Rashti, J. Green, P. Balaji, A. Afsahi, and W. Gropp, “Multi-core and network aware mpi topology functions,” in Proceedings of the 18th European MPI Users’ Group Meeting (EuroMPI ’11), 2011, p. 50–60.
  71. A. Faraj, S. Kumar, B. Smith, A. Mamidala, and J. Gunnels, “MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and optimizations,” in Proceedings of the 17th IEEE Symposium on High Performance Interconnects (HOTI ’09), 2009, pp. 63–72.
  72. S. Kumar, A. Mamidala, P. Heidelberger, D. Chen, and D. Faraj, “Optimization of mpi collective operations on the ibm blue gene/q supercomputer,” Int. J. High Perform. Comput. Appl., vol. 28, no. 4, p. 450–464, 2014.
  73. G. Almási, P. Heidelberger, C. J. Archer, X. Martorell, C. C. Erway, J. E. Moreira, B. Steinmacher-Burow, and Y. Zheng, “Optimization of mpi collective communication on bluegene/l systems,” in Proceedings of the 19th Annual International Conference on Supercomputing (ICS ’05), 2005, p. 253–262.
  74. T. Adachi, N. Shida, K. Miura, S. Sumimoto, A. Uno, M. Kurokawa, F. Shoji, and M. Yokokawa, “The design of ultra scalable mpi collective communication on the k computer,” Computer Science - Research and Development, vol. 28, 2013.
  75. J. Dong, Z. Cao, T. Zhang, J. Ye, S. Wang, F. Feng, L. Zhao, X. Liu, L. Song, L. Peng, Y. Guo, X. Jiang, L. Tang, Y. Du, Y. Zhang, P. Pan, and Y. Xie, “EFLOPS: Algorithm and system co-design for a high performance distributed training platform,” in Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA ’20), 2020, pp. 610–622.
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets