Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution (2406.00059v2)

Published 29 May 2024 in cs.CL, cs.DC, and cs.LG

Abstract: The complexity of LLM serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yechen Xu (4 papers)
  2. Xinhao Kong (5 papers)
  3. Tingjun Chen (24 papers)
  4. Danyang Zhuo (33 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets