Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 74 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 98 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Claude Sonnet 4 40 tok/s Pro
2000 character limit reached

Revisiting Reliability in Large-Scale Machine Learning Research Clusters (2410.21680v2)

Published 29 Oct 2024 in cs.DC and cs.LG

Abstract: Reliability is a fundamental challenge in operating large-scale ML infrastructures, particularly as the scale of ML models and training clusters continues to grow. Despite decades of research on infrastructure failures, the impact of job failures across different scales remains unclear. This paper presents a view of managing two large, multi-tenant ML clusters, providing quantitative analysis, operational experience, and our own perspective in understanding and addressing reliability concerns at scale. Our analysis reveals that while large jobs are most vulnerable to failures, smaller jobs make up the majority of jobs in the clusters and should be incorporated into optimization objectives. We identify key workload properties, compare them across clusters, and demonstrate essential reliability requirements for pushing the boundaries of ML training at scale. We hereby introduce a taxonomy of failures and key reliability metrics, analyze 11 months of data from two state-of-the-art ML environments with 4 million jobs and over 150 million A100 GPU hours. Building on our data, we fit a failure model to project Mean Time to Failure for various GPU scales. We further propose a method to estimate a related metric, Effective Training Time Ratio, as a function of job parameters, and we use this model to gauge the efficacy of potential software mitigations at scale. Our work provides valuable insights and future research directions for improving the reliability of AI supercomputer clusters, emphasizing the need for flexible, workload-agnostic, and reliability-aware infrastructure, system software, and algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. “Building meta’s genai infrastructure,” https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/, (Accessed on 8/5/2024).
  2. “Enabling next-generation ai workloads: Announcing tpu v5p and ai hypercomputer,” https://cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-tpu-v5p-and-ai-hypercomputer, (Accessed on 7/17/2024).
  3. “Introducing ml productivity goodput: a metric to measure ai system efficiency,” https://cloud.google.com/blog/products/ai-machine-learning/goodput-metric-as-measure-of-ml-productivity, (Accessed on 9/9/2024).
  4. “Introducing the ai research supercluster — meta’s cutting-edge ai supercomputer for ai research,” https://ai.meta.com/blog/ai-rsc/, (Accessed on 7/18/2024).
  5. “Multifactor priority plugin,” https://slurm.schedmd.com/priority_multifactor.html#general, (Accessed on 8/7/2024).
  6. “Nccl-tests,” https://github.com/NVIDIA/nccl-tests, (Accessed on 8/06/2024).
  7. “Nvidia dgx a100,” https://images.nvidia.com/aem-dam/Solutions/Data-Center/nvidia-dgx-a100-datasheet.pdf, (Accessed on 8/7/2024).
  8. “Nvidia gb200 nvl72,” https://www.nvidia.com/en-us/data-center/gb200-nvl72/, (Accessed on 9/13/2024).
  9. “Nvidia xid error messages,” https://docs.nvidia.com/deploy/pdf/XID_Errors.pdf, (Accessed on 7/19/2024).
  10. “Prolog and epilog guide,” https://slurm.schedmd.com/prolog_epilog.html, (Accessed on 8/7/2024).
  11. “The shield: Self-healing interconnect,” https://network.nvidia.com/related-docs/whitepapers/WP_Mellanox_SHIELD.pdf, (Accessed on 7/18/2024).
  12. “Submit it!” https://github.com/facebookincubator/submitit, (Accessed on 9/15/2024).
  13. J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, J. Liang, Y. Lu, C. K. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, H. Suk, S. Zhang, M. Suo, P. Tillet, X. Zhao, E. Wang, K. Zhou, R. Zou, X. Wang, A. Mathews, W. Wen, G. Chanan, P. Wu, and S. Chintala, “Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2024.
  14. P. Barham, A. Chowdhery, J. Dean, S. Ghemawat, S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang, S. Roy, B. Saeta, P. Schuh, R. Sepassi, L. E. Shafey, C. A. Thekkath, and Y. Wu, “Pathways: Asynchronous distributed dataflow for ml,” 2022. [Online]. Available: https://arxiv.org/abs/2203.12533
  15. A. Choudhury, Y. Wang, T. Pelkonen, K. Srinivasan, A. Jain, S. Lin, D. David, S. Soleimanifard, M. Chen, A. Yadav, R. Tijoriwala, D. Samoylov, and C. Tang, “MAST: Global scheduling of ML training across Geo-Distributed datacenters at hyperscale,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24).   Santa Clara, CA: USENIX Association, Jul. 2024, pp. 563–580. [Online]. Available: https://www.usenix.org/conference/osdi24/presentation/choudhury
  16. X. Dai, J. Hou, C.-Y. Ma, S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang, A. Dubey, M. Yu, A. Kadian, F. Radenovic, D. Mahajan, K. Li, Y. Zhao, V. Petrovic, M. K. Singh, S. Motwani, Y. Wen, Y. Song, R. Sumbaly, V. Ramanathan, Z. He, P. Vajda, and D. Parikh, “Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack,” 2023. [Online]. Available: https://arxiv.org/abs/2309.15807
  17. Gemini Team et. al., “Gemini: A family of highly capable multimodal models,” 2024. [Online]. Available: https://arxiv.org/abs/2312.11805
  18. T. Gershon, S. Seelam, B. Belgodere, M. Bonilla, L. Hoang, D. Barnett, I.-H. Chung, A. Mohan, M.-H. Chen, L. Luo, R. Walkup, C. Evangelinos, S. Salaria, M. Dombrowa, Y. Park, A. Kayi, L. Schour, A. Alim, A. Sydney, P. Maniotis, L. Schares, B. Metzler, B. Karacali-Akyamac, S. Wen, T. Chiba, S. Choochotkaew, T. Yoshimura, C. Misale, T. Elengikal, K. O. Connor, Z. Liu, R. Molina, L. Schneidenbach, J. Caden, C. Laibinis, C. Fonseca, V. Tarasov, S. Sundararaman, F. Schmuck, S. Guthridge, J. Cohn, M. Eshel, P. Muench, R. Liu, W. Pointer, D. Wyskida, B. Krull, R. Rose, B. Wolfe, W. Cornejo, J. Walter, C. Malone, C. Perucci, F. Franco, N. Hinds, B. Calio, P. Druyan, R. Kilduff, J. Kienle, C. McStay, A. Figueroa, M. Connolly, E. Fost, G. Roma, J. Fonseca, I. Levy, M. Payne, R. Schenkel, A. Malki, L. Schneider, A. Narkhede, S. Moshref, A. Kisin, O. Dodin, B. Rippon, H. Wrieth, J. Ganci, J. Colino, D. Habeger-Rose, R. Pandey, A. Gidh, A. Gaur, D. Patterson, S. Salmani, R. Varma, R. Rumana, S. Sharma, A. Gaur, M. Mishra, R. Panda, A. Prasad, M. Stallone, G. Zhang, Y. Shen, D. Cox, R. Puri, D. Agrawal, D. Thorstensen, J. Belog, B. Tang, S. K. Gupta, A. Biswas, A. Maheshwari, E. Gampel, J. V. Patten, M. Runion, S. Kaki, Y. Bogin, B. Reitz, S. Pritko, S. Najam, S. Nambala, R. Chirra, R. Welp, F. DiMitri, F. Telles, A. Arvelo, K. Chu, E. Seminaro, A. Schram, F. Eickhoff, W. Hanson, E. Mckeever, D. Joseph, P. Chaudhary, P. Shivam, P. Chaudhary, W. Jones, R. Guthrie, C. Bostic, R. Islam, S. Duersch, W. Sawdon, J. Lewars, M. Klos, M. Spriggs, B. McMillan, G. Gao, A. Kamra, G. Singh, M. Curry, T. Katarki, J. Talerico, Z. Shi, S. S. Malleni, and E. Gallen, “The infrastructure powering ibm’s gen ai model development,” 2024. [Online]. Available: https://arxiv.org/abs/2407.05467
  19. F. Hansen and G. K. Pedersen, “Jensen’s operator inequality,” Bulletin of the London Mathematical Society, vol. 35, no. 04, p. 553–564, 2003.
  20. M. Harchol-Balter, K. Sigman, and A. Wierman, “Asymptotic convergence of scheduling policies with respect to slowdown,” Performance Evaluation, vol. 49, no. 1, pp. 241–256, 2002, performance 2002. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0166531602001323
  21. T. He, X. Li, Z. Wang, K. Qian, J. Xu, W. Yu, and J. Zhou, “Unicron: Economizing self-healing llm training at scale,” 2023. [Online]. Available: https://arxiv.org/abs/2401.00134
  22. S. Hsia, A. Golden, B. Acun, N. Ardalani, Z. DeVito, G.-Y. Wei, D. Brooks, and C.-J. Wu, “Mad-max beyond single-node: Enabling large machine learning model acceleration on distributed systems,” in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), 2024, pp. 818–833.
  23. Intel, Hewlett-Packard, NEC, and Dell, “Intelligent platform management interface specification second generation v2.0,” Intel, Tech. Rep., 2013.
  24. M. Jeon, S. Venkataraman, A. Phanishayee, J. Qian, W. Xiao, and F. Yang, “Analysis of Large-Scale Multi-Tenant GPU clusters for DNN training workloads,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19).   Renton, WA: USENIX Association, Jul. 2019, pp. 947–960. [Online]. Available: https://www.usenix.org/conference/atc19/presentation/jeon
  25. Z. Jiang, H. Lin, Y. Zhong, Q. Huang, Y. Chen, Z. Zhang, Y. Peng, X. Li, C. Xie, S. Nong, Y. Jia, S. He, H. Chen, Z. Bai, Q. Hou, S. Yan, D. Zhou, Y. Sheng, Z. Jiang, H. Xu, H. Wei, Z. Zhang, P. Nie, L. Zou, S. Zhao, L. Xiang, Z. Liu, Z. Li, X. Jia, J. Ye, X. Jin, and X. Liu, “MegaScale: Scaling large language model training to more than 10,000 GPUs,” in 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24).   Santa Clara, CA: USENIX Association, Apr. 2024, pp. 745–760. [Online]. Available: https://www.usenix.org/conference/nsdi24/presentation/jiang-ziheng
  26. N. Jouppi, G. Kurian, S. Li, P. Ma, R. Nagarajan, L. Nai, N. Patil, S. Subramanian, A. Swing, B. Towles, C. Young, X. Zhou, Z. Zhou, and D. A. Patterson, “Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023.
  27. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA ’17.   New York, NY, USA: Association for Computing Machinery, 2017, p. 1–12. [Online]. Available: https://doi.org/10.1145/3079856.3080246
  28. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment Anything,” 2023. [Online]. Available: https://arxiv.org/abs/2304.02643
  29. V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch, M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation in large transformer models,” 2022. [Online]. Available: https://arxiv.org/abs/2205.05198
  30. S. Levy, K. B. Ferreira, N. DeBardeleben, T. Siddiqua, V. Sridharan, and E. Baseman, “Lessons learned from memory errors observed over the lifetime of cielo,” in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018, pp. 554–565.
  31. B. Li, R. Arora, S. Samsi, T. Patel, W. Arcand, D. Bestor, C. Byun, R. B. Roy, B. Bergeron, J. Holodnak, M. Houle, M. Hubbell, M. Jones, J. Kepner, A. Klein, P. Michaleas, J. McDonald, L. Milechin, J. Mullen, A. Prout, B. Price, A. Reuther, A. Rosa, M. Weiss, C. Yee, D. Edelman, A. Vanterpool, A. Cheng, V. Gadepally, and D. Tiwari, “AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2022, pp. 1224–1237.
  32. K. Maeng, S. Bharuka, I. Gao, M. C. Jeffrey, V. Saraph, B.-Y. Su, C. Trippel, J. Yang, M. Rabbat, B. Lucia, and C.-J. Wu, “Cpr: Understanding and improving failure tolerant training for deep learning recommendation with partial recovery,” 2020. [Online]. Available: https://arxiv.org/abs/2011.02999
  33. Meta, “The official Meta Llama 3 GitHub site,” 2024. [Online]. Available: https://github.com/meta-llama/llama3
  34. NLLB Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault, G. M. Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, and J. Wang, “No Language Left Behind: Scaling Human-Centered Machine Translation,” 2022. [Online]. Available: https://arxiv.org/abs/2207.04672
  35. OpenAI et. al., “Gpt-4 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2303.08774
  36. K. Qian, Y. Xi, J. Caoa, J. Gao, Y. Xu, Y. Guan, B. Fu, X. Shi, F. Zhu, R. Miao, C. Wang, P. Wang, P. Zhang, X. Zeng, Z. Yao, E. Zhai, and D. Cai, “Alibaba hpn: A data center network for large language model training,” in SIGCOMM, 2024.
  37. C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch, “Heterogeneity and dynamicity of clouds at scale: Google trace analysis,” ser. SoCC ’12.   New York, NY, USA: Association for Computing Machinery, 2012. [Online]. Available: https://doi.org/10.1145/2391229.2391236
  38. B. Roziere, M.-A. Lachaux, L. Chanussot, and G. Lample, “Unsupervised translation of programming languages,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  39. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” 2020. [Online]. Available: https://arxiv.org/abs/1909.08053
  40. D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and A. Bland, “Understanding GPU errors on large-scale HPC systems and the implications for system design and operation,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015, pp. 331–342.
  41. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023. [Online]. Available: https://arxiv.org/abs/2302.13971
  42. A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, “Large-scale cluster management at Google with Borg,” in Proceedings of the European Conference on Computer Systems (EuroSys), Bordeaux, France, 2015.
  43. A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, B. Guo, J. Zhang, X. Zhang, R. Adkins, W. Ngan, J. Wang, I. Cruz, B. Akula, A. Akinyemi, B. Ellis, R. Moritz, Y. Yungster, A. Rakotoarison, L. Tan, C. Summers, C. Wood, J. Lane, M. Williamson, and W.-N. Hsu, “Audiobox: Unified Audio Generation with Natural Language Prompts,” 2023. [Online]. Available: https://arxiv.org/abs/2312.15821
  44. W. Wang, M. Ghobadi, K. Shakeri, Y. Zhang, and N. Hasani, “Optimized network architectures for large language model training with billions of parameters,” arXiv preprint arXiv:2307.12169, 2023.
  45. A. B. Yoo, M. A. Jette, and M. Grondona, “Slurm: Simple linux utility for resource management,” in Workshop on job scheduling strategies for parallel processing.   Springer, 2003, pp. 44–60.
  46. T. Zhang, K. Liu, J. Kosaian, J. Yang, and R. Vinayak, “Efficient fault tolerance for recommendation model training via erasure coding,” Proc. VLDB Endow., vol. 16, no. 11, p. 3137–3150, jul 2023.

Summary

  • The paper introduces a structured failure taxonomy and a predictive model for MTTF that highlights the vulnerability of large jobs in expansive GPU clusters.
  • It analyzes over 150 million GPU hours and 4 million jobs to reveal reliability metrics and the critical impact of both small and large-scale job failures.
  • The study proposes the ETTR metric and recommends adaptive mitigation strategies such as proactive health checks and dynamic routing to enhance cluster performance.

Summary of "Revisiting Reliability in Large-Scale Machine Learning Research Clusters"

The paper "Revisiting Reliability in Large-Scale Machine Learning Research Clusters" presents an in-depth investigation into the operational challenges and scalability issues faced by large ML infrastructures. The paper is based on two extensive, multi-tenant clusters, each dedicated to various AI research workloads and comprising thousands of NVIDIA A100 GPUs. This research emphasizes the importance of reliability in ML operations, given the trend towards increasingly larger ML models and training clusters.

Key Findings and Contributions

  1. Failure Taxonomy and Model Development: The authors propose a taxonomy of infrastructure failures, which provides a structured understanding of failure types and their potential causes. Additionally, they develop a failure model to predict the mean time to failure (MTTF) for various GPU scales, projecting a decrease in MTTF with increasing GPU counts. The results indicate that large jobs are more susceptible to failures, underscoring the need for robust failure mitigation strategies.
  2. Quantitative Analysis of Failure Data: The paper analyzes over 150 million GPU hours and 4 million jobs executed over an 11-month period to extract failure rates and other reliability metrics. This extensive data collection allows the authors to derive insights into job-level failure characteristics, highlighting that while large jobs are more vulnerable, smaller jobs dominate in frequency and should also be considered in optimization strategies.
  3. ETTR and Cluster Performance: The authors introduce the concept of Effective Training Time Ratio (ETTR) to quantify how efficiently clusters utilize time, considering both productive runtime and interruptions. This metric is vital for evaluating and optimizing the scheduling and reliability of ML training jobs, and provides a complementary perspective to existing metrics such as Model FLOPS Utilization (MFU).
  4. Mitigation Strategies: To improve cluster reliability, the paper suggests several mitigation strategies, including the implementation of robust health checks, adaptive routing to navigate network failures, and proactive identification of "lemon nodes" that frequently cause failures. These strategies are designed to enhance the operational reliability of the cluster infrastructure, thereby facilitating uninterrupted ML training at scale.

Implications and Future Directions

The findings have important implications for the design and operation of AI supercomputing clusters. As the demand for training large-scale ML models continues to grow, it becomes critical to develop infrastructure that can support such scale while maintaining high reliability. The suggested proactive monitoring and mitigation techniques have the potential to significantly enhance the robustness and efficiency of ML clusters, paving the way for more effective model training processes.

Looking forward, the research highlights the necessity of integrating reliability-awareness into system software and algorithms to minimize the impact of failures on productive computation. There is also a compelling case for further exploration into scheduling algorithms that can better anticipate and manage failures in a multi-tenant cluster environment. As AI supercomputing infrastructures evolve, building adaptive, failure-tolerant systems will be fundamental to sustaining progress in AI research and development.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 posts and received 82 likes.