Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 74 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 13 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 87 tok/s Pro

Kimi K2 98 tok/s Pro

GPT OSS 120B 464 tok/s Pro

Claude Sonnet 4 40 tok/s Pro

2000 character limit reached

Revisiting Reliability in Large-Scale Machine Learning Research Clusters (2410.21680v2)

Published 29 Oct 2024 in cs.DC and cs.LG

Abstract: Reliability is a fundamental challenge in operating large-scale ML infrastructures, particularly as the scale of ML models and training clusters continues to grow. Despite decades of research on infrastructure failures, the impact of job failures across different scales remains unclear. This paper presents a view of managing two large, multi-tenant ML clusters, providing quantitative analysis, operational experience, and our own perspective in understanding and addressing reliability concerns at scale. Our analysis reveals that while large jobs are most vulnerable to failures, smaller jobs make up the majority of jobs in the clusters and should be incorporated into optimization objectives. We identify key workload properties, compare them across clusters, and demonstrate essential reliability requirements for pushing the boundaries of ML training at scale. We hereby introduce a taxonomy of failures and key reliability metrics, analyze 11 months of data from two state-of-the-art ML environments with 4 million jobs and over 150 million A100 GPU hours. Building on our data, we fit a failure model to project Mean Time to Failure for various GPU scales. We further propose a method to estimate a related metric, Effective Training Time Ratio, as a function of job parameters, and we use this model to gauge the efficacy of potential software mitigations at scale. Our work provides valuable insights and future research directions for improving the reliability of AI supercomputer clusters, emphasizing the need for flexible, workload-agnostic, and reliability-aware infrastructure, system software, and algorithms.

References (46)

Summary

The paper introduces a structured failure taxonomy and a predictive model for MTTF that highlights the vulnerability of large jobs in expansive GPU clusters.
It analyzes over 150 million GPU hours and 4 million jobs to reveal reliability metrics and the critical impact of both small and large-scale job failures.
The study proposes the ETTR metric and recommends adaptive mitigation strategies such as proactive health checks and dynamic routing to enhance cluster performance.

Summary of "Revisiting Reliability in Large-Scale Machine Learning Research Clusters"

The paper "Revisiting Reliability in Large-Scale Machine Learning Research Clusters" presents an in-depth investigation into the operational challenges and scalability issues faced by large ML infrastructures. The paper is based on two extensive, multi-tenant clusters, each dedicated to various AI research workloads and comprising thousands of NVIDIA A100 GPUs. This research emphasizes the importance of reliability in ML operations, given the trend towards increasingly larger ML models and training clusters.

Key Findings and Contributions

Failure Taxonomy and Model Development: The authors propose a taxonomy of infrastructure failures, which provides a structured understanding of failure types and their potential causes. Additionally, they develop a failure model to predict the mean time to failure (MTTF) for various GPU scales, projecting a decrease in MTTF with increasing GPU counts. The results indicate that large jobs are more susceptible to failures, underscoring the need for robust failure mitigation strategies.
Quantitative Analysis of Failure Data: The paper analyzes over 150 million GPU hours and 4 million jobs executed over an 11-month period to extract failure rates and other reliability metrics. This extensive data collection allows the authors to derive insights into job-level failure characteristics, highlighting that while large jobs are more vulnerable, smaller jobs dominate in frequency and should also be considered in optimization strategies.
ETTR and Cluster Performance: The authors introduce the concept of Effective Training Time Ratio (ETTR) to quantify how efficiently clusters utilize time, considering both productive runtime and interruptions. This metric is vital for evaluating and optimizing the scheduling and reliability of ML training jobs, and provides a complementary perspective to existing metrics such as Model FLOPS Utilization (MFU).
Mitigation Strategies: To improve cluster reliability, the paper suggests several mitigation strategies, including the implementation of robust health checks, adaptive routing to navigate network failures, and proactive identification of "lemon nodes" that frequently cause failures. These strategies are designed to enhance the operational reliability of the cluster infrastructure, thereby facilitating uninterrupted ML training at scale.

Implications and Future Directions

The findings have important implications for the design and operation of AI supercomputing clusters. As the demand for training large-scale ML models continues to grow, it becomes critical to develop infrastructure that can support such scale while maintaining high reliability. The suggested proactive monitoring and mitigation techniques have the potential to significantly enhance the robustness and efficiency of ML clusters, paving the way for more effective model training processes.

Looking forward, the research highlights the necessity of integrating reliability-awareness into system software and algorithms to minimize the impact of failures on productive computation. There is also a compelling case for further exploration into scheduling algorithms that can better anticipate and manage failures in a multi-tenant cluster environment. As AI supercomputing infrastructures evolve, building adaptive, failure-tolerant systems will be fundamental to sustaining progress in AI research and development.