Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration (2407.13126v1)

Published 18 Jul 2024 in cs.DC

Abstract: Continuous learning (CL) has emerged as one of the most popular deep learning paradigms deployed in modern cloud GPUs. Specifically, CL has the capability to continuously update the model parameters (through model retraining) and use the updated model (if available) to serve overtime arriving inference requests. It is generally beneficial to co-locate the retraining and inference together to enable timely model updates and avoid model transfer overheads. This brings the need for GPU sharing among retraining and inferences. Meanwhile, multiple CL workloads can share the modern GPUs in the cloud, leading to multi-tenancy execution. In this paper, we observe that prior GPU-sharing techniques are not optimized for multi-tenancy CL workloads. Specifically, they do not coherently consider the accuracy of the retraining model and the inference service level objective (SLO) attainment. Moreover, they cannot accommodate the overtime dynamics (e.g., inference arrival intensity) in CL execution. In this paper, we propose MIGRator, a novel GPU reconfiguration runtime that dynamically performs GPU reconfiguration for multi-tenancy CL workloads. MIGRator is based on the recent NVIDIA multi-instance GPU (MIG) to mitigate resource contention and formulates the reconfiguration optimization into Integer Linear Programming (ILP) to dynamically identify, reconfigure, and allocate the GPU instances. MIGRator leverages the "Goodput" metric in the ILP objective function to consider both inference SLO attainment and model accuracy in the reconfiguration exploration. We evaluate MIGRator using representative multi-tenancy CL workloads. The results show our approach outperforms the state-of-the-art GPU sharing techniques (i.e., Ekya, Astraea, and PARIS) by 17\%, 21\%, and 20\%, respectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Tianyu Wang (154 papers)
  2. Sheng Li (219 papers)
  3. Bingyao Li (2 papers)
  4. Yue Dai (30 papers)
  5. Ao Li (46 papers)
  6. Geng Yuan (58 papers)
  7. Yufei Ding (81 papers)
  8. Youtao Zhang (9 papers)
  9. Xulong Tang (23 papers)

Summary

We haven't generated a summary for this paper yet.