- The paper demonstrates that Salus enables rapid job switching using fine-grained GPU sharing, reducing average job completion time by up to 3.19× via SRTF scheduling.
- The paper shows that its GPU lane memory sharing optimizes resource allocation, achieving a 2.38× improvement in GPU utilization during hyper-parameter tuning workloads.
- The paper highlights that Salus allows multiple DL models to share a single GPU concurrently, delivering a 42× boost in GPU utilization compared to traditional methods.
Salus: Enhancing GPU Efficiency in Deep Learning Applications
The paper presents "Salus," a system designed to significantly improve GPU utilization in deep learning (DL) environments by introducing fine-grained GPU sharing primitives. Traditional GPU management paradigms allocate entire GPUs to single DL jobs, resulting in inefficiencies such as GPU underutilization and head-of-line (HOL) blocking. Salus addresses these issues by facilitating efficient sharing of GPU resources among concurrent DL applications, thereby optimizing both utilization and performance.
Key Contributions and Primitives
Salus's architecture revolves around two primary GPU sharing primitives: fast job switching and GPU lane memory sharing. These primitives collectively enable more flexible scheduling policies such as preemption, fair sharing, and task packing, which are crucial in high-performance, multi-tenant environments.
- Fast Job Switching: By differentiating memory allocations into persistent (model and framework-internal) and ephemeral categories, Salus achieves rapid job switching without incurring the substantial overhead associated with traditional context-switching mechanisms. This is particularly beneficial for implementing sophisticated scheduling policies like shortest-remaining-time-first (SRTF) that require preemptive capabilities.
- GPU Lane Memory Sharing: The GPU memory is subdivided into lanes, each supporting serialized execution of concurrent jobs. This abstraction facilitates the simultaneous execution of multiple DL jobs with distinct ephemeral memory requirements, effectively leveraging underutilized GPU memory without compromising the independence necessary for dynamic allocation patterns.
Scheduling Policies and Empirical Results
Salus supports various scheduling policies, highlighting its flexibility and effectiveness:
- SRTF: By leveraging its fast job switching ability, Salus implements SRTF to minimize the average job completion time, outperforming traditional FIFO scheduling by a factor of 3.19 in some instances.
- PACK: This policy aims to maximize GPU utilization by packing several concurrent jobs efficiently. Salus achieves a 2.38× improvement in utilization during hyper-parameter tuning workloads.
- FAIR: Ensures equitable resource distribution among jobs, encouraging fair access and throughput balance.
Moreover, Salus demonstrates substantial improvements in DL inference processes, showcasing a 42× enhancement in GPU utilization by allowing multiple models to share the same GPU, significantly surpassing the capability of NVIDIA's Multi-Process Service (MPS).
Practical Implications and Future Directions
By consolidating GPU access through Salus, institutions can optimize their hardware resources, notably decreasing operational costs in DL inference and training tasks. This system provides a framework for cluster management tools to implement adaptive, intelligent resource scheduling and rediscovery mechanisms.
Salus sets the groundwork for further exploration into the following areas: optimizing the lane management algorithms to automatically adjust to diverse workloads, extending the system to support multiple GPUs in tandem, and addressing challenges in distributed GPU setups using technologies like RDMA for further scalability.
In conclusion, Salus presents a compelling solution for the inefficiencies plaguing current DL computational paradigms, offering a robust system that enhances both the operational performance of DL frameworks and the cost-effectiveness of GPU resources. Its pioneering approach paves the way for ongoing research and development in resource-efficient DL application deployment.