Evaluating Multi-Hop Tool Use in LLMs: Insights from ToolHop
The paper "ToolHop: A Query-Driven Benchmark for Evaluating LLMs in Multi-Hop Tool Use" introduces ToolHop, a comprehensive dataset designed to rigorously assess the capabilities of LLMs in scenarios that demand multi-hop reasoning with tool utilization. This research addresses a prominent gap in the evaluation frameworks available for LLM tool-use capabilities, focusing particularly on multi-hop scenarios which are more complex and reflective of real-world tasks.
Main Contributions
- Dataset Construction: ToolHop is constructed using a novel query-driven approach. It comprises 995 multi-hop queries coupled with 3,912 locally executable tools. This methodology starts with multi-hop queries and subsequently designs required tools, ensuring meaningful interdependencies and the capability of handling diverse real-world queries.
- Evaluation Across Models: The authors evaluate ToolHop against 14 LLMs from five major model families: LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT. This comprehensive evaluation uncovers critical insights into the current state of tool-use capabilities in LLMs, indicating that even top models like GPT-4o achieve only a 49.04% accuracy in multi-hop tool-use tasks.
- Analytical Insights: Detailed analyses highlight specific challenges and variations in tool-use strategies across model families. For instance, the Qwen2.5 model family shows tendencies towards parallel tool calls which lead to hallucinations, whereas models in the GPT family demonstrate better adaptation with detailed feedback.
- Implications for Future Developments: The findings underscore significant room for improvement in the multi-hop reasoning capabilities of LLMs. The paper suggests future research directions focusing on developing more adaptable tool-use models and improving interaction feedback mechanisms to enhance model performance.
Technical Approach
The methodology behind ToolHop distinguishes itself with a structured process comprising tool creation, document refinement, and code generation. This ensures that the collected tools have genuine multi-hop reasoning capabilities through a coherent and scalable dataset design. By adhering to a query-driven approach, ToolHop aims to rectify the deficiency of dependency in previous tool-driven datasets where queries were often simulated for pre-collected tools.
Evaluation Metrics
To thoroughly assess the models, the paper presents two primary dimensions of evaluation: answer correctness and invocation error. The correctness of model outputs is evaluated under direct answer scenarios and tool-mandatory contexts, reflecting the nuanced understanding and procedural accuracy of each model. Additionally, invocation errors are quantified to understand the models' precision during tool usage.
Implications and Future Work
The ToolHop dataset and its application illustrate that while LLMs have made strides in tool utilization, their capability in mastering multi-hop tools remains suboptimal. The authors call for advancements both in dataset construction methodologies and model architectures that can better simulate adaptive real-world usage.
Looking forward, the paper identifies potential areas for enhancement:
- Developing LLMs that comprehend complex, real-world task requirements and optimize parallel processing without sacrificing accuracy.
- Leveraging rich feedback for dynamic correction of tool invocation procedures.
- Creating more robust datasets that include a wider range of structured data types and parameter complexities.
In summary, ToolHop sets a new standard in evaluating LLMs' tool-use capabilities and paves the way for the development of models with improved reasoning skills in multi-step tasks. The paper serves as an important contribution to the ongoing efforts to develop truly general-purpose LLMs capable of performing complex, real-world reasoning tasks.