Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (1901.05758v2)

Published 17 Jan 2019 in cs.DC

Abstract: With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.

Authors (6)

Myeongjae Jeon (10 papers)
Shivaram Venkataraman (48 papers)
Amar Phanishayee (23 papers)
Junjie Qian (1 paper)
Wencong Xiao (10 papers)
Fan Yang (878 papers)

Citations (322)

View on Semantic Scholar

Summary

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

This paper presents a comprehensive paper of the intricacies involved in the utilization and scheduling of multi-tenant GPU clusters dedicated to Deep Neural Network (DNN) training workloads. The paper derives its conclusions from a two-month workload trace from Microsoft's {Philly} GPU cluster, providing valuable insights that contribute to the design of next-generation cluster management systems.

Key Findings

Deep learning workloads impose distinct requirements on cluster infrastructure compared to conventional big data workloads. The paper identifies significant challenges in resource allocation and job scheduling due to DNN workloads' dependence on gang scheduling and locality constraints for efficient model training. The analysis highlights three primary issues:

Gang Scheduling Requirements: DNN frameworks necessitate simultaneous scheduling of all tasks, leading to inelastic job scheduling. This gang scheduling constraint often results in resource fragmentation and extended queuing times.
Locality Constraints Impact: The enforced locality constraints, essential for efficient data parallelism and inter-GPU communication, significantly influence queuing delays and GPU utilization. The paper shows that relaxing locality constraints can reduce queuing time but simultaneously lower GPU utilization rates due to increased synchronization overheads and resource interference.
Job Failures: The research indicates an overall job failure rate of approximately 30%, primarily attributed to user-induced errors (misconfigurations, programming errors) and infrastructure-level failures. These failures are costly in terms of GPU time, especially since they often occur late in the job's runtime.

Scheduler Design Implications

The paper provides three practical guidelines for improving cluster schedulers for DNN workloads:

Prioritize Locality: Considering DNN job longevity, better locality prioritization could enhance GPU efficiency and job runtime, suggesting a strategic trade-off between queuing delays and intra-job locality.
Interference Mitigation: For optimal GPU utilization, schedulers should mitigate resource contention and interference by intelligently isolating jobs on dedicated hardware nodes when possible, or employing job migration features to alleviate resource fragmentation.
Enhanced Failure Diagnostics: Implementing preliminary checks for programming errors and maintaining a smaller testing pool of GPUs before scaling jobs can drastically reduce unnecessary resource consumption. Furthermore, schedulers could adopt adaptive retry strategies based on real-time error classification.

Theoretical and Practical Implications

The detailed findings underscore the need for refined scheduling policies that cater to DNN workloads' unique requirements, such as enhanced intra-job communication and efficient GPU utilization in shared environments. They also highlight the potential benefits of real-time diagnostics to reduce failure rates significantly. The theoretical implications call for further research into preemptive scheduling strategies and advanced resource management techniques that could better accommodate the inelastic nature of deep learning jobs.

Future Outlook

Given the increasing relevance of deep learning applications, future developments in AI workload management will likely emphasize automated failure diagnosis, enhanced resource sharing architectures, and dynamic scheduling algorithms that can adapt to varying workload demands. This paper sets the stage for continued exploration of workload-specific enhancements in AI infrastructure, potentially influencing future standards in computational resource management for large-scale DNN training.

The paper not only enhances our understanding of the operational dynamics of multi-tenant GPU clusters but also provides actionable insights to improve the design and efficiency of future machine learning systems. The public availability of the workload trace from this paper is an invaluable resource for further research and development in the domain of AI infrastructure optimization.

PDF Markdown