Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios

Published 30 Jan 2024 in cs.CL | (2401.17167v3)

Abstract: The recent trend of using LLMs as tool agents in real-world applications underscores the necessity for comprehensive evaluations of their capabilities, particularly in complex scenarios involving planning, creating, and using tools. However, existing benchmarks typically focus on simple synthesized queries that do not reflect real-world complexity, thereby offering limited perspectives in evaluating tool utilization. To address this issue, we present UltraTool, a novel benchmark designed to improve and evaluate LLMs' ability in tool utilization within real-world scenarios. UltraTool focuses on the entire process of using tools - from planning and creating to applying them in complex tasks. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage and simplifies the task solving by mapping out the intermediate steps. Thus, unlike previous work, it eliminates the restriction of pre-defined toolset. Through extensive experiments on various LLMs, we offer novel insights into the evaluation of capabilities of LLMs in tool utilization, thereby contributing a fresh perspective to this rapidly evolving field. The benchmark is publicly available at https://github.com/JoeYing1019/UltraTool.

Abstract PDF Upgrade to Chat

Authors (13)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces UltraTool, a novel benchmark that evaluates LLMs’ abilities to plan, create, and use tools effectively in complex, real-world scenarios.
It employs a multi-dimensional framework across 22 domains with 5,824 examples to assess planning methodologies, tool creation, and usage accuracy.
Results indicate that larger models like GPT-4 outperform smaller ones, with language orientation and JSON formatting significantly impacting tool utilization.

Comprehensive Evaluation of LLMs' Tool Utilization in Real-World Scenarios

Introduction

The paper "Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios" (2401.17167) introduces UltraTool, a robust benchmark aimed at assessing the capabilities of LLMs in tool utilization across real-world complex scenarios. Unlike existing benchmarks that rely on synthetic queries and pre-defined toolsets, UltraTool emphasizes the complexity inherent in real-life applications, focusing on the entire process of using tools characterized by planning, creating, and utilizing them efficiently. This multifaceted approach aims to map out intermediate steps for task-solving without restrictions imposed by predefined tools.

UltraTool Framework

UltraTool is structured to evaluate LLMs based on their ability to conduct comprehensive tool utilization through multiple dimensions:

Planning: UltraTool requires models to engage in precise, logical, multi-step planning within natural language frameworks before proceeding with tool applications, ensuring that complex problems are decomposed into manageable sub-tasks.
Tool Creation: The benchmarks assess LLMs' awareness regarding the sufficiency of existing tools and their capability to craft new tools when necessary, facilitating effective task solutions that might demand tool creation beyond predefined sets.
Tool Usage: This dimension evaluates three essential aspects: tool use awareness, tool selection appropriateness, and the proper specification of input parameters for tool execution.

UltraTool spans 22 distinct domains and incorporates a dataset consisting of 5,824 examples, each accompanied by extensive tool instructions, challenging the LLMs to demonstrate adaptability over diverse and complex queries.

Figure 1: Illustration of (a) tool utilization process in real-world complex scenarios and (b) construction process of previous benchmarks.

Results and Analysis

In evaluating various LLMs, including both open-source models like LLaMA2 and closed-source ones like GPT-4, UltraTool offers insightful revelations into tool utilization capabilities:

Performance Metrics: GPT-4 notably excels across both planning and tool creation dimensions, achieving the highest overall scores. Smaller-scale models, while syntactically sound, often lag in comprehensive query understanding and executional robustness.
Model Scale: There is a direct correlation observed between model scale and tool utilization ability, reinforcing the idea that larger model architectures generally demonstrate enhanced performance capabilities.
Language Variability: Language orientation significantly affects performance, with Chinese-oriented LLMs generally performing better on Chinese datasets, and similarly for English-directed models.
Figure 2: Multi-dimensional scores of planning and tool creation evaluation of 4 representative models under different model scales, including: Qwen-7B, Qwen-14B, Qwen-72B, and GPT-4.

Error Analysis and Metrics

Error analysis identified five major error types including instruction non-adherence, hallucination outputs, redundant content generation, incomplete outputs, and incorrect JSON formatting. The JSON format correctness rates of the models demonstrate a close alignment with overall performance, suggesting that maintaining format compliance is crucial for effective tool utilization.

Figure 3: Correlation between JSON format correct rate with the overall score in \tablename~\ref{tab:main}.

Implications and Future Directions

UltraTool sets the stage for profound developments in LLMs' real-world tool utilization. By providing a more nuanced, multi-dimensional evaluation framework, it underscores the need for advancements in LLM capabilities surrounding tool creation and accurate execution. The results highlight areas for significant improvement, especially in comprehensive understanding and response accuracy.

The benchmark not only stimulates further research into sophisticated applications of LLMs for tool interaction but also propels the field towards refining models’ capabilities to adeptly navigate real-world complexities.

Conclusion

UltraTool represents a significant step in evaluating and enhancing LLM tool utilization capabilities by addressing the real-world complexities of planning, creation, and application. It serves as a foundational platform for future explorations and improvements in tool-related LLM functionalities, underlining the necessity for meticulous design and execution abilities in complex scenarios.

Markdown Report Issue