Asynchronous LLM Function Calling (2412.07017v1)

Published 9 Dec 2024 in cs.CL and cs.AI

Abstract: LLMs use function calls to interface with external tools and data source. However, the current approach to LLM function calling is inherently synchronous, where each call blocks LLM inference, limiting LLM operation and concurrent function execution. In this work, we propose AsyncLM, a system for asynchronous LLM function calling. AsyncLM improves LLM's operational efficiency by enabling LLMs to generate and execute function calls concurrently. Instead of waiting for each call's completion, AsyncLM introduces an interrupt mechanism to asynchronously notify the LLM in-flight when function calls return. We design an in-context protocol for function calls and interrupts, provide fine-tuning strategy to adapt LLMs to the interrupt semantics, and implement these mechanisms efficiently on LLM inference process. We demonstrate that AsyncLM can reduce end-to-end task completion latency from 1.6x-5.4x compared to synchronous function calling on a set of benchmark tasks in the Berkeley function calling leaderboard (BFCL). Furthermore, we discuss how interrupt mechanisms can be extended to enable novel human-LLM or LLM-LLM interactions.

PDF HTML Abstract

Asynchronous Function Calling in LLMs

The paper "AsyncLM: A Synchronous LLM Function Calling" addresses an inherent inefficiency within the function calling mechanisms of LLMs. According to the authors, the current synchronous form of LLM function calling is resource inefficient because it blocks LLM inference when interfacing with external tools and data sources. This poses a significant limitation, particularly in scenarios requiring concurrent operations or rapid response rates, where each function awaits the completion of its predecessor.

To remedy these challenges, the authors propose a novel system called AsyncLM, which facilitates asynchronous function call executions. The core objective of AsyncLM is to improve operational efficiency by utilizing a unique interrupt mechanism that notifies the LLM once a function call concludes. This advancement allows the LLM to process function calls alongside ongoing token generation, effectively overlapping execution to reduce end-to-end latency.

Mechanism and Design

AsyncLM introduces a co-design between asynchronous function calls and LLM inference processes. The system employs a domain-specific language (CML) to denote function calls and interrupts, enabling the LLM to identify and respond to the return of function results effectively. The paper outlines a fine-tuning strategy to acclimatize LLMs to these interrupt semantics, ensuring that this transition from synchronous to asynchronous processing does not compromise the LLM's accuracy or inference capabilities.

The implementation on popular LLMs such as Llama 3 and GPT-4o includes in-context prompting techniques for models like GPT-4o where direct fine-tuning isn’t feasible. The developers also leveraged cloud API services like OpenAI's streaming API to demonstrate the practical application of AsyncLM in reducing latency.

Performance and Results

Performance evaluations were conducted using datasets like the Berkeley Function Calling Leaderboard (BFCL), revealing that AsyncLM significantly reduces task completion latency by factors ranging from 1.6× to 5.4× compared to traditional synchronous methods. Noteworthy, the paper asserts that asynchronous function calling is theoretically shown to be at least as fast or faster than synchronous parallel alternatives. This advancement is achieved without degrading the function calling accuracy, maintaining performance consistent with the underlying model's capabilities.

Theoretical Implications and Future Applications

Theoretically, AsyncLM invites broader considerations in LLM architecture, particularly in how LLMs interact with external functions and APIs. The ability to use asynchronous operations can be extended to enable more complex interactions, such as those between human users and LLMs, or inter-agent communications among multiple LLMs, providing new horizons for LLM application development.

Practically, the introduction of asynchronous function executions offers profound implications for applications demanding real-time processing, such as AI-driven autonomous agents and neurosymbolic systems that blend symbolic reasoning and LLM capabilities. Future research might explore the dynamism offered by asynchronous function calls in various LLM-based systems, refining the interrupt mechanism for increasingly sophisticated use cases.

In conclusion, this paper provides a substantive contribution to optimizing LLM operations, showcasing a significant reduction in latency without compromising accuracy and opening new possibilities for AI applications. AsyncLM presents a compelling case for rethinking the traditional synchronous design paradigm, moving towards more efficient, asynchronous interactions that enhance the functionality and responsiveness of LLMs.