GPU Cluster Scheduling for Network-Sensitive Deep Learning (2401.16492v1)

Published 29 Jan 2024 in cs.PF, cs.DC, and cs.LG

Abstract: We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69% in end-to-end Makespan for training all jobs compared to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83% and minimizing the communication overheads by up to 98% under congested networking conditions.

References (49)

Authors (6)

Aakash Sharma (3 papers)
Vivek M. Bhasi (3 papers)
Sonali Singh (6 papers)
George Kesidis (72 papers)
Mahmut T. Kandemir (10 papers)
Chita R. Das (67 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Dally, a scheduler that minimizes communication overhead by deferring job placement until optimal GPU proximity is achieved.
It integrates a network-sensitive preemption mechanism and an auto-tuner that adjusts delay timers based on real-time network metrics.
Experiments demonstrate that Dally reduces training makespan by up to 69% and average job completion times by up to 83% under congested conditions.

GPU Cluster Scheduling for Network-Sensitive Deep Learning

The paper entitled "GPU Cluster Scheduling for Network-Sensitive Deep Learning" addresses the challenge of efficiently scheduling distributed deep learning (DDL) jobs in GPU clusters by considering the communication overheads inherent in training large-scale deep neural networks (DNNs). The authors introduce "Dally," a GPU-cluster scheduler designed to minimize these communication delays by making intelligent scheduling decisions based on network topology and the network sensitivity of jobs.

Key Contributions

The primary contribution of this research is the development of a novel GPU-cluster scheduler, Dally, incorporating three integral components:

Delay Scheduling Mechanism: The scheduler employs a classical delay scheduling algorithm which defers job placement in anticipation of consolidated GPU resources, thereby minimizing communication overhead. This contrast with traditional schedulers that are relatively agnostic to GPU proximity.
Network-Sensitive Job Preemption: A network-sensitive preemption strategy allows jobs highly sensitive to communication latencies to have priority in GPU placement decisions, further reducing the training time variations due to inter-GPU communication costs.
Auto-Tuner Mechanism: An innovative addition to optimize the delay timers within the scheduling algorithm automatically. This mechanism adjusts the wait times for resource allocation based on observed network usage patterns and job requirements, enhancing the scheduler's adaptability to varying network conditions.

Experimental Evaluation

The authors present a comprehensive evaluation of Dally using ArtISt-sim, a specially developed high-fidelity DDL cluster simulator. By simulating various workloads with real-world traces, the researchers demonstrate that Dally can achieve substantial reductions in model training makespan and average job completion times compared to prevailing methods. Specifically:

Dally improves the end-to-end makespan by up to 69% compared to traditional schedulers like Tiresias that employ consolidation strategies based on DNN model skew.
It reduces the average job completion time by up to 83%.
Notably, the communication overhead is minimized by up to 98% in congested networking conditions.

Theoretical and Practical Implications

The work has significant theoretical and practical implications for AI and machine learning:

Theoretical Insights: This paper provides insights into how communication delays in DDL can be mitigated through strategic scheduling based on network awareness.
Practical Applications: By reducing communication overhead and improving resource allocation efficiency, organizations leveraging shared GPU clusters in both public and private clouds can significantly cut down on infrastructure costs.

Furthermore, this research highlights the importance of evolving traditional scheduling algorithms to adapt to modern high-performance networking technologies. Future research could explore more sophisticated machine learning-based prediction models for resource availability and integrate these into scheduling decisions.

Conclusion

In conclusion, the paper effectively showcases the potential impact of network-aware GPU scheduling on the efficiency of DNN model training. By holistically addressing the considerations of job placement and communication overheads, Dally represents a step forward in optimizing resource usage within multi-tenant GPU clusters. Future directions might include exploration into heterogeneous cluster environments and extending high-fidelity simulations to further enhance scheduling heuristic development.

Tweets

https://twitter.com/HPCPapers/status/1752573095561289940