Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Distributed Inference Serving for Large Language Models (2305.05920v3)

Published 10 May 2023 in cs.LG and cs.DC

Abstract: LLMs power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Bingyang Wu (7 papers)
  2. Yinmin Zhong (11 papers)
  3. Zili Zhang (25 papers)
  4. Gang Huang (86 papers)
  5. Xuanzhe Liu (59 papers)
  6. Xin Jin (285 papers)
  7. Shengyu Liu (5 papers)
  8. Fangyue Liu (3 papers)
  9. Yuanhang Sun (1 paper)
Citations (61)