Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications (1902.04610v1)

Published 12 Feb 2019 in cs.DC and cs.LG

Abstract: GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications. However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time sharing and preemption are expensive. Worse, when a DL application cannot completely use a GPU's resources, the GPU cannot be efficiently shared between multiple applications, leading to GPU underutilization. We present Salus to enable two GPU sharing primitives: fast job switching and memory sharing, in order to achieve fine-grained GPU sharing among multiple DL applications. Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues. We show that these primitives can then be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases. Our integration of Salus with TensorFlow and evaluation on popular DL jobs show that Salus can improve the average completion time of DL training jobs by $3.19\times$, GPU utilization for hyper-parameter tuning by $2.38\times$, and GPU utilization of DL inference applications by $42\times$ over not sharing the GPU and $7\times$ over NVIDIA MPS with small overhead.

Citations (68)

View on Semantic Scholar

Summary

The paper demonstrates that Salus enables rapid job switching using fine-grained GPU sharing, reducing average job completion time by up to 3.19× via SRTF scheduling.
The paper shows that its GPU lane memory sharing optimizes resource allocation, achieving a 2.38× improvement in GPU utilization during hyper-parameter tuning workloads.
The paper highlights that Salus allows multiple DL models to share a single GPU concurrently, delivering a 42× boost in GPU utilization compared to traditional methods.

Salus: Enhancing GPU Efficiency in Deep Learning Applications

The paper presents "Salus," a system designed to significantly improve GPU utilization in deep learning (DL) environments by introducing fine-grained GPU sharing primitives. Traditional GPU management paradigms allocate entire GPUs to single DL jobs, resulting in inefficiencies such as GPU underutilization and head-of-line (HOL) blocking. Salus addresses these issues by facilitating efficient sharing of GPU resources among concurrent DL applications, thereby optimizing both utilization and performance.

Key Contributions and Primitives

Salus's architecture revolves around two primary GPU sharing primitives: fast job switching and GPU lane memory sharing. These primitives collectively enable more flexible scheduling policies such as preemption, fair sharing, and task packing, which are crucial in high-performance, multi-tenant environments.

Fast Job Switching: By differentiating memory allocations into persistent (model and framework-internal) and ephemeral categories, Salus achieves rapid job switching without incurring the substantial overhead associated with traditional context-switching mechanisms. This is particularly beneficial for implementing sophisticated scheduling policies like shortest-remaining-time-first (SRTF) that require preemptive capabilities.
GPU Lane Memory Sharing: The GPU memory is subdivided into lanes, each supporting serialized execution of concurrent jobs. This abstraction facilitates the simultaneous execution of multiple DL jobs with distinct ephemeral memory requirements, effectively leveraging underutilized GPU memory without compromising the independence necessary for dynamic allocation patterns.

Scheduling Policies and Empirical Results

Salus supports various scheduling policies, highlighting its flexibility and effectiveness:

SRTF: By leveraging its fast job switching ability, Salus implements SRTF to minimize the average job completion time, outperforming traditional FIFO scheduling by a factor of 3.19 in some instances.
PACK: This policy aims to maximize GPU utilization by packing several concurrent jobs efficiently. Salus achieves a 2.38× improvement in utilization during hyper-parameter tuning workloads.
FAIR: Ensures equitable resource distribution among jobs, encouraging fair access and throughput balance.

Moreover, Salus demonstrates substantial improvements in DL inference processes, showcasing a 42× enhancement in GPU utilization by allowing multiple models to share the same GPU, significantly surpassing the capability of NVIDIA's Multi-Process Service (MPS).

Practical Implications and Future Directions

By consolidating GPU access through Salus, institutions can optimize their hardware resources, notably decreasing operational costs in DL inference and training tasks. This system provides a framework for cluster management tools to implement adaptive, intelligent resource scheduling and rediscovery mechanisms.

Salus sets the groundwork for further exploration into the following areas: optimizing the lane management algorithms to automatically adjust to diverse workloads, extending the system to support multiple GPUs in tandem, and addressing challenges in distributed GPU setups using technologies like RDMA for further scalability.

In conclusion, Salus presents a compelling solution for the inefficiencies plaguing current DL computational paradigms, offering a robust system that enhances both the operational performance of DL frameworks and the cost-effectiveness of GPU resources. Its pioneering approach paves the way for ongoing research and development in resource-efficient DL application deployment.

PDF Markdown

Related Papers

GitHub

GitHub - SymbioticLab/Salus: Fine-grained GPU sharing primitives (142 stars)