Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models (2401.14351v2)

Published 25 Jan 2024 in cs.LG and cs.DC
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models

Abstract: This paper presents ServerlessLLM, a distributed system designed to support low-latency serverless inference for LLMs. By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage, minimizing the need for remote checkpoint downloads and ensuring efficient checkpoint loading. The design of ServerlessLLM features three core contributions: (i) \emph{fast multi-tier checkpoint loading}, featuring a new loading-optimized checkpoint format and a multi-tier loading system, fully utilizing the bandwidth of complex storage hierarchies on GPU servers; (ii) \emph{efficient live migration of LLM inference}, which enables newly initiated inferences to capitalize on local checkpoint storage while ensuring minimal user interruption; and (iii) \emph{startup-time-optimized model scheduling}, which assesses the locality statuses of checkpoints on each server and schedules the model onto servers that minimize the time to start the inference. Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems, reducing latency by 10 - 200X across various LLM inference workloads.

Overview of ServerlessLLM

ServerlessLLM introduces a locality-enhanced serverless inference system exclusively for LLMs. It effectively leverages underutilized storage bandwidth and capacity available on GPU servers, thereby reducing remote checkpoint downloads and expediting checkpoint loading times.

Checkpoint Loading Optimization

The core of ServerlessLLM's design is a new loading-optimized checkpoint format and a multi-tier checkpoint loading system. This structure significantly enhances storage bandwidth use of GPU servers for LLM checkpoint loading. A notable introduction is a loading function that serves as a bridge between LLM libraries and ServerlessLLM's model manager, facilitating rapid and direct data transfer from storage to GPUs. This results in heightened performance, with ServerlessLLM outperforming current systems like PyTorch and Safetensors by a substantial margin across various LLM workloads.

Locality-Driven Inference and Live Migration

ServerlessLLM innovates live migration for LLM inference within serverless systems to effectively ensure locality-driven server allocation while preserving low latency. Two primary mechanisms power this live migration: an efficient token-based migration that identifies the minimal set of tokens needed for precise inference transfer and a two-stage live migration process that enables ongoing LLM inference transfer without affecting user experience. This novel approach allows ServerlessLLM to dynamically allocate servers based on locality, offering lower latency than methods reliant on model inference time prediction or those that preempt ongoing inferences.

Locality-Aware Server Allocation

ServerlessLLM integrates models to accurately estimate the loading time for checkpoints from different storage tiers and the time required for migrating an ongoing LLM inference to another server. Using these estimations, ServerlessLLM can intelligently schedule models, capitalizing on local checkpoint placement. This capability is crucial for enabling the system to evaluate each server's status in a cluster and to allocate resources for minimizing startup latency.

Comprehensive Experiments and Results

ServerlessLLM is rigorously tested through microbenchmarking and real-world traces. These experiments including comparing ServerlessLLM against various baseline methods like Safetensors and PyTorch along with running diverse LLM inference workloads in a GPU cluster setting. ServerlessLLM exhibits an impressive 10 - 200x improvement in latency performance over state-of-the-art systems, validating its model loading efficiency, efficacy of inference migration, and optimized server allocation strategy.

ServerlessLLM, with its innovative design and experimentally proven performance advantages, positions itself as a leading solution for sustainable, efficient, and cost-effective LLM inference services, paving the way for more scalable and responsive AI-powered applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yao Fu (83 papers)
  2. Leyang Xue (16 papers)
  3. Yeqi Huang (4 papers)
  4. Andrei-Octavian Brabete (2 papers)
  5. Dmitrii Ustiugov (6 papers)
  6. Yuvraj Patel (2 papers)
  7. Luo Mai (22 papers)
Citations (5)