ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities (2408.04682v1)

Published 8 Aug 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recent LLMs advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox

PDF HTML Abstract

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

The paper introduces ToolSandbox, an evaluation benchmark designed to assess LLM tool-use capabilities in a stateful, conversational, and interactive manner. Recent advancements in LLMs have fueled interest in their applications as autonomous agents capable of solving real-world challenges via tool use. ToolSandbox addresses several limitations found in existing benchmarks by providing a comprehensive framework that evaluates stateful tool execution, implicit state dependencies, and dynamic evaluation strategies for both intermediate and final milestones.

Key Contributions

ToolSandbox makes several significant advancements over previous benchmarks such as the Berkeley Function Calling Leaderboard (BFCL), ToolEval, and API-Bank:

Stateful Tool Execution:
- Existing benchmarks often rely on stateless tools interacting with static environments via RESTful APIs. ToolSandbox, however, includes tools that can alter world states, such as enabling cellular service before sending a message, which introduces implicit state dependencies between tools.
Conversational Evaluation:
- Automated conversational evaluation is enabled through an LLM-based user simulator, allowing realistic, on-policy dialogs. This is a departure from the single-turn or off-policy dialog trajectories in existing benchmarks.
Interactive Evaluation:
- Real-world scenarios often involve unexpected errors and follow-ups. ToolSandbox's interactive framework can track the intricate interactions between the user, agent, and environment. Important milestones and minefields are predefined to capture key events for fine-grained evaluation.

Evaluation Framework

Execution Context

The core of ToolSandbox is a Python native LLM testing environment that maintains an Execution Context consisting of a world state (mutable databases) and a Message Bus (dialog history).

Tools and Roles

Tools: Highly composable Python functions that can be called to manipulate the world state or query external APIs. Tools can be augmented to test various aspects of LLM performance, such as distraction tools or scrambled tool names and descriptions.
Roles: There are three primary roles—User, Agent, and Execution Environment. The execution environment executes Python snippets representing tool calls, and the User role is implemented using an LLM simulated user, which interacts based on carefully designed prompts.

Evaluation Metrics

The evaluation methodology incorporates milestones and minefields:

Milestones are critical steps that must occur to complete a task, providing rich intermediate and final execution signals.
Minefields are events that must not occur, used to test scenarios where a task becomes unsolvable with the given tools.

Experimental Results

ToolSandbox was tested comprehensively, highlighting several insights:

Open Source versus Proprietary Models:

There is a substantial performance gap between proprietary models like GPT-4o and open-source models such as Hermes-2-Pro. Proprietary models consistently score higher in various categories, including state dependency and canonicalization.

State Dependency and Efficiency:

Larger models like GPT-4 and Claude-3-Opus demonstrated susceptibility to erroneous parallel tool calls in state-dependent scenarios, thereby affecting their efficiency.

Canonicalization Challenges:

Time-related arguments and ambiguous entities present significant challenges across models. Misinterpretations and hallucinations of timestamps were noted as common issues.

Insufficient Information:

This category effectively tested the model's ability to identify when a task is unsolvable and avoid hallucination, a scenario where top-performing models like GPT-3.5-Turbo and GPT-4 still struggled.

Implications and Future Work

Practical Implications

ToolSandbox provides a robust framework for evaluating LLM tool-use capabilities, moving beyond static, single-turn evaluations to dynamic, multi-turn, and interactive scenarios. This makes it a valuable tool for developing more capable and versatile LLM-based agents.

Theoretical Implications

The introduction of stateful and state-dependent tools, realistic user simulation, and flexible evaluation metrics marks a significant step forward in understanding the intricacies of LLM-driven task completion.

Speculations on Future Developments

The limitations acknowledged in the paper open avenues for future work, including automated milestone annotation, enhanced user simulators with expanded tool sets, and orchestration-level solutions for mandatory confirmations. Additionally, incorporating tools that can spawn daemon processes presents an interesting problem for both orchestration and model design.

Conclusion

ToolSandbox offers a sophisticated and nuanced framework to evaluate the emerging capabilities of LLMs in tool-use scenarios. By addressing state dependency, conversational nuances, and interactive scenarios, it paves the way for more advanced applications of LLMs as autonomous agents. Future work inspired by this benchmark could further enhance the landscape of tool-assisted LLM research, driving improvements in both theoretical and practical domains.