Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution (2406.00059v2)

Published 29 May 2024 in cs.CL, cs.DC, and cs.LG

Abstract: The complexity of LLM serving workloads has substantially increased due to the integration with external tool invocations, such as ChatGPT plugins. In this paper, we identify a new opportunity for efficient LLM serving for requests that trigger tools: tool partial execution alongside LLM decoding. To this end, we design Conveyor, an efficient LLM serving system optimized for handling requests involving external tools. We introduce a novel interface for tool developers to expose partial execution opportunities to the LLM serving system and a request scheduler that facilitates partial tool execution. Our results demonstrate that tool partial execution can improve request completion latency by up to 38.8%.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Yechen Xu (4 papers)
Xinhao Kong (5 papers)
Tingjun Chen (24 papers)
Danyang Zhuo (33 papers)

Citations (2)

View on Semantic Scholar

Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution (2406.00059v2)

Related Papers

Tweets