ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use (2501.02506v1)

Published 5 Jan 2025 in cs.CL

Abstract: Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of LLMs. However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/bytedance-research/ToolHop.

PDF Abstract

Evaluating Multi-Hop Tool Use in LLMs: Insights from ToolHop

The paper "ToolHop: A Query-Driven Benchmark for Evaluating LLMs in Multi-Hop Tool Use" introduces ToolHop, a comprehensive dataset designed to rigorously assess the capabilities of LLMs in scenarios that demand multi-hop reasoning with tool utilization. This research addresses a prominent gap in the evaluation frameworks available for LLM tool-use capabilities, focusing particularly on multi-hop scenarios which are more complex and reflective of real-world tasks.

Main Contributions

Dataset Construction: ToolHop is constructed using a novel query-driven approach. It comprises 995 multi-hop queries coupled with 3,912 locally executable tools. This methodology starts with multi-hop queries and subsequently designs required tools, ensuring meaningful interdependencies and the capability of handling diverse real-world queries.
Evaluation Across Models: The authors evaluate ToolHop against 14 LLMs from five major model families: LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT. This comprehensive evaluation uncovers critical insights into the current state of tool-use capabilities in LLMs, indicating that even top models like GPT-4o achieve only a 49.04% accuracy in multi-hop tool-use tasks.
Analytical Insights: Detailed analyses highlight specific challenges and variations in tool-use strategies across model families. For instance, the Qwen2.5 model family shows tendencies towards parallel tool calls which lead to hallucinations, whereas models in the GPT family demonstrate better adaptation with detailed feedback.
Implications for Future Developments: The findings underscore significant room for improvement in the multi-hop reasoning capabilities of LLMs. The paper suggests future research directions focusing on developing more adaptable tool-use models and improving interaction feedback mechanisms to enhance model performance.

Technical Approach

The methodology behind ToolHop distinguishes itself with a structured process comprising tool creation, document refinement, and code generation. This ensures that the collected tools have genuine multi-hop reasoning capabilities through a coherent and scalable dataset design. By adhering to a query-driven approach, ToolHop aims to rectify the deficiency of dependency in previous tool-driven datasets where queries were often simulated for pre-collected tools.

Evaluation Metrics

To thoroughly assess the models, the paper presents two primary dimensions of evaluation: answer correctness and invocation error. The correctness of model outputs is evaluated under direct answer scenarios and tool-mandatory contexts, reflecting the nuanced understanding and procedural accuracy of each model. Additionally, invocation errors are quantified to understand the models' precision during tool usage.

Implications and Future Work

The ToolHop dataset and its application illustrate that while LLMs have made strides in tool utilization, their capability in mastering multi-hop tools remains suboptimal. The authors call for advancements both in dataset construction methodologies and model architectures that can better simulate adaptive real-world usage.

Looking forward, the paper identifies potential areas for enhancement:

Developing LLMs that comprehend complex, real-world task requirements and optimize parallel processing without sacrificing accuracy.
Leveraging rich feedback for dynamic correction of tool invocation procedures.
Creating more robust datasets that include a wider range of structured data types and parameter complexities.

In summary, ToolHop sets a new standard in evaluating LLMs' tool-use capabilities and paves the way for the development of models with improved reasoning skills in multi-step tasks. The paper serves as an important contribution to the ongoing efforts to develop truly general-purpose LLMs capable of performing complex, real-world reasoning tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (14)

Junjie Ye (66 papers)
Zhengyin Du (8 papers)
Xuesong Yao (3 papers)
Weijian Lin (5 papers)
Yufei Xu (24 papers)
Zehui Chen (41 papers)
Zaiyuan Wang (2 papers)
Sining Zhu (1 paper)
Zhiheng Xi (37 papers)
Siyu Yuan (46 papers)
Tao Gui (127 papers)
Qi Zhang (784 papers)
Xuanjing Huang (287 papers)
Jiechao Chen (1 paper)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1876483851092152801

https://twitter.com/joywriteai/status/1876732302614544513

https://twitter.com/rohanpaul_ai/status/1878930874055667718

https://twitter.com/TheHeroShep/status/1877103090916044881