Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (2008.09213v1)

Published 20 Aug 2020 in cs.DC

Abstract: Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multi-user objectives, like fairness and makespan. Unfortunately, existing schedulers largely do not consider performance heterogeneity. In this paper, we propose Gavel, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies. Gavel expresses these policies as optimization problems, making it easy to optimize for objectives in a heterogeneity-aware way, while also being cognizant of performance optimizations like space sharing. Gavel then uses a round-based scheduling mechanism to ensure jobs receive their ideal allocation given the target scheduling policy. Gavel's heterogeneity-aware policies allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as average job completion time and makespan by up to 3.5x compared to heterogeneity-agnostic policies.

Citations (180)

View on Semantic Scholar

Summary

The paper introduces Gavel, a novel scheduler that integrates hardware heterogeneity into optimization-based policies, reducing job completion time by up to 3.5 times.
It transforms traditional scheduling methods into heterogeneity-aware approaches using a generalized optimization framework to boost throughput and fairness.
The study demonstrates practical improvements in resource allocation, job performance, and cost efficiency across diverse accelerator clusters.

Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

The paper "Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads" introduces Gavel, a novel scheduling system designed to enhance the allocation of resources in GPU and accelerator clusters used for deep learning training. Presently, the widespread deployment of diverse accelerators such as GPUs, TPUs, FPGAs, and custom ASICs has introduced significant heterogeneity in performance characteristics across different models. Existing schedulers often overlook this heterogeneity, resulting in suboptimal resource allocation that impacts fairness, efficiency, and load capacity.

Highlights and Contributions

Gavel Scheduler: The core proposal of the paper is Gavel, a scheduler that optimally utilizes heterogeneous clusters by articulating various scheduling policies as optimization problems. This approach allows Gavel to systematically integrate heterogeneity into the scheduling process, providing substantial improvements over existing methods in terms of job completion time and cluster throughput.
Optimization Framework: Gavel transforms traditional scheduling policies like LAS, FIFO, and makespan into heterogeneity-aware policies. This is achieved by using a generalized optimization framework that considers effective job throughput across different accelerators and incorporates constraints to ensure valid job allocation.
Space Sharing and Placement Sensitivity: Gavel introduces policies that leverage space sharing and placement sensitivity to improve resource utilization and job performance. Space sharing enables multiple jobs to concurrently utilize a single accelerator, while placement sensitivity optimizes the physical proximity of resources allocated for communication-intensive distributed training.
Broad Policy Support: The system demonstrates versatility by supporting single-level policies such as fairness and least attained service, as well as hierarchical policies that manage resources across multiple organizational entities. These policies are expressed as solvable optimization problems which can be efficiently executed.
Performance Improvements: In simulations on both physical and larger simulated clusters, Gavel's policies show significant improvements in several objectives. It reduces average job completion time by up to 3.5 times, improves makespan by 2.5 times, and lowers cost by 1.4 times compared to heterogeneity-agnostic policies.

Implications

The implications of this research are substantial for both theoretic and practical domains in AI and cluster computing. The ability to effectively schedule heterogeneous resources directly impacts the scalability and efficiency of AI workloads, especially in large cloud environments where resource diversity is the norm. By formalizing policies as optimization problems, Gavel provides a foundation upon which more complex and adaptive scheduling algorithms can be built.

Speculations on Future Developments

Looking ahead, the concepts introduced in Gavel could drive new innovations in resource management systems, particularly in enhancing adaptability and efficiency in multi-tenant cloud solutions. Emerging developments might focus on further refining these optimization models to accommodate dynamically changing workloads or integrate with emerging hardware platforms. Research might also explore extending the framework to support additional policy goals such as energy efficiency and sustainability, aligning resource allocation with broader organizational objectives.

In summary, Gavel addresses critical gaps in existing technologies by embracing the heterogeneity inherent to modern clusters and advancing the effectiveness of deep learning workload scheduling through a principled, optimization-based approach. This work not only influences immediate improvements in resource utilization but also sets a precedent for future explorations in scalable AI deployment and infrastructure management.