Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs (2406.01566v1)

Published 3 Jun 2024 in cs.DC, cs.CL, and cs.LG

Abstract: This paper introduces Helix, a distributed system for high-throughput, low-latency LLM serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem for a directed, weighted graph, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous cluster settings ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 2.7$\times$ and reduces prompting and decoding latency by up to 2.8$\times$ and 1.3$\times$, respectively, compared to best existing approaches.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yixuan Mei (2 papers)
  2. Yonghao Zhuang (10 papers)
  3. Xupeng Miao (37 papers)
  4. Juncheng Yang (12 papers)
  5. Zhihao Jia (43 papers)
  6. Rashmi Vinayak (2 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets