Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning (2109.01611v1)

Published 1 Sep 2021 in cs.DC and cs.AI

Abstract: As machine learning techniques are applied to a widening range of applications, high throughput ML inference servers have become critical for online service applications. Such ML inference servers pose two challenges: first, they must provide a bounded latency for each request to support consistent service-level objective (SLO), and second, they can serve multiple heterogeneous ML models in a system as certain tasks involve invocation of multiple models and consolidating multiple models can improve system utilization. To address the two requirements of ML inference servers, this paper proposes a new ML inference scheduling framework for multi-model ML inference servers. The paper first shows that with SLO constraints, current GPUs are not fully utilized for ML inference tasks. To maximize the resource efficiency of inference servers, a key mechanism proposed in this paper is to exploit hardware support for spatial partitioning of GPU resources. With the partitioning mechanism, a new abstraction layer of GPU resources is created with configurable GPU resources. The scheduler assigns requests to virtual GPUs, called gpu-lets, with the most effective amount of resources. The paper also investigates a remedy for potential interference effects when two ML tasks are running concurrently in a GPU. Our prototype implementation proves that spatial partitioning enhances throughput by 102.6% on average while satisfying SLOs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Seungbeom Choi (3 papers)
  2. Sunho Lee (1 paper)
  3. Yeonjae Kim (3 papers)
  4. Jongse Park (14 papers)
  5. Youngjin Kwon (12 papers)
  6. Jaehyuk Huh (6 papers)
Citations (19)

Summary

We haven't generated a summary for this paper yet.