Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving (2404.02015v2)

Published 2 Apr 2024 in cs.DC

Abstract: LLMs have demonstrated remarkable performance, and organizations are racing to serve LLMs of varying sizes as endpoints for use-cases like chat, programming and search. However, efficiently serving multiple LLMs poses significant challenges for existing approaches due to varying popularity of LLMs. In the paper, we present MuxServe, a flexible spatial-temporal multiplexing system for efficient multiple LLM serving. The key insight behind is to colocate LLMs considering their popularity to multiplex memory resources, and leverage the characteristics of prefill and decoding phases to separate and flexibly colocate them to multiplex computation resources. MuxServe formally formulates the multiplexing problem, and proposes a novel placement algorithm and adaptive batch scheduling strategy to identify optimal colocations and maximize utilization. MuxServe designs a unified resource manager to enable flexible and efficient multiplexing. Evaluation results show that MuxServe can achieves up to $1.8\times$ higher throughput or processes $2.9\times$ more requests within $99\%$ SLO attainment. The code is available at: \url{https://github.com/hao-ai-lab/MuxServe}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiangfei Duan (8 papers)
  2. Runyu Lu (4 papers)
  3. Haojie Duanmu (5 papers)
  4. Xiuhong Li (14 papers)
  5. Xingcheng Zhang (29 papers)
  6. Dahua Lin (336 papers)
  7. Ion Stoica (177 papers)
  8. Hao Zhang (947 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com