Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 153 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 76 tok/s Pro
Kimi K2 169 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4.5 39 tok/s Pro
2000 character limit reached

ToolBench: LLM Tool-Use Benchmark

Updated 29 October 2025
  • ToolBench is a richly annotated benchmark that assesses LLM tool-use by translating complex instructions into real-world API calls.
  • It features over 16,000 APIs and 3,451 tools, with automated instruction generation and DFSDT-based solution path annotation.
  • The benchmark has spurred innovations in progressive retrieval and error-driven learning, significantly enhancing LLM performance.

ToolBench is a large-scale, richly annotated benchmark that has become central to the development and evaluation of tool-augmented LLMs. Designed to rigorously assess and enable the manipulation of diverse real-world software tools by LLMs, ToolBench provides both the infrastructure and data required for training, benchmarking, and improving both open-source and closed-source agents in complex, multi-step, and multi-tool settings. Its technical detail, scale, and diverse coverage have made it the canonical reference for tool-use competence in the LLM era.

1. Genesis, Scope, and Core Contributions

ToolBench was introduced to address the gap between the tool-using abilities of open-source LLMs and closed LLM APIs, and to enable industrially relevant, secure, and performant tool manipulation by LLMs (Xu et al., 2023, Qin et al., 2023). The dataset targets scenarios where models must translate natural language instructions into sequences of external API/tool calls, frequently involving reasoning, planning, multi-step decomposition, and real execution.

Key properties:

  • Scale: Encompasses 3,451 tools and 16,464 real-world APIs across 49 categories, all sourced from RapidAPI.
  • Coverage: Includes single-step, multi-step, single-tool, and multi-tool instruction scenarios.
  • Automation: Instruction and solution path annotation are performed automatically through LLM prompting (primarily ChatGPT), reducing human labor while maintaining broad coverage of realistic workflows.

Dataset Construction Workflow

Stage Methodology Output
1. API Collection Crawl + filter for API quality (availability, meaningful responses) 16,464 high-quality APIs, 3,451 tools, 49 cats
2. Instruction Generation ChatGPT generates diverse realistic instructions for sampled APIs/tools 87,413 single-tool; 110,066 multi-tool insts
3. Solution Path Annotation Chain-of-Thought prompting; DFSDT tree search produces (reasoning, API, args, resp) 126,486 annotated (inst, solution path) pairs

Further augmentation and cleaning steps are applied in later works (notably via meta-verification and error-focused pipelines (Ma et al., 5 Jun 2025)).

2. Dataset Structure and Annotation Paradigm

ToolBench adopts a format where each data instance records:

  • Instruction: Free-form, often complex NL user request.
  • Relevant APIs/Tools: Documented with names, descriptions, parameter schema, and example responses.
  • Solution Path: Ordered sequence of (<thought>, <api call>, <response>) triples, capturing the reasoning and execution trajectory, including intermediate feedback and possible branch/“give up” steps.
  • Final Answer: The final synthesized result, possibly requiring aggregation of multiple tool outputs.

The annotation process uses LLMs to generate plausible API call trees by registering API endpoints as callable “functions” and employing a depth-first search (DFS) strategy (DFSDT) to discover valid, potentially multi-step solution trajectories. This tree-based approach efficiently spans both easy and hard cases, allowing for backtracking and exploration—a necessity for realistic multi-API settings (Qin et al., 2023).

3. Benchmark Tasks, Metrics, and Evaluation Protocols

ToolBench supports both controlled small-benchmark tasks (for in-depth ablation, see (Xu et al., 2023)) and large-scale multi-domain settings (as in (Qin et al., 2023) and subsequent works).

Classic Task Set (Editor’s term)

  • 8 diverse tasks: OpenWeather API, The Cat API, Home Search, Trip Booking, Google Sheets (via gspread), VirtualHome (simulated household actions), Tabletop (robotics), WebShop (web navigation).
  • Test cases per task: ~100 instructions, each mapped to ≈7–15 APIs per task.

Large-Scale, Multi-API Setting

  • Datasets of 126,000+ instruction–solution path pairs; test sets covering unseen instructions, tools, and categories for robust generalization.

Evaluation Metrics

Metric Definition / Formula
Success Rate / Pass Rate Proportion of instructions successfully completed: #Solved#Total\frac{\# \text{Solved}}{\# \text{Total}}
Win Rate Pairwise comparison (preferable solution between two models, per ToolEval or LLM)
API Complexity Log-based distance reflecting the probability of generating test actions from demos
Correctness API name and all required arguments match ground truth (syntactic + semantic)
Hallucination Rate Fraction of actions calling non-existent or irrelevant APIs
Plan.EM/Act.EM Planning/Action Exact Match (for plan/action sequence generation tasks)
Preference Rank Comparative sequence ranking for quality (lower is better)

Automated evaluation is supported via ToolEval (LLM-based) (Qin et al., 2023), and, in large-scale/stable settings, via StableToolBench (Guo et al., 12 Mar 2024), which uses a virtual API server and GPT-4 as the judge for reproducibility and realism.

4. Methodological Innovations Leveraging ToolBench

The release of ToolBench catalyzed a surge of research on effective tool planning, retrieval, and execution with LLMs. Key frameworks and findings include:

  • DFSDT reasoning (Qin et al., 2023): Depth-first search in a decision tree space for multi-step tool trajectory annotation and evaluation, enabling backtracking and robust solution discovery.
  • Progressive and preference-driven retrieval (Anantha et al., 2023, Moon et al., 2 Sep 2024): Embedding-based, contrastive, and staged strategies enabling scalable and context-aware tool selection.
  • Pipeline architectures (e.g., Sum2Act (Liu et al., 28 Feb 2024), ProTIP (Anantha et al., 2023)): Decomposing tool-use into explicit planning, action, state-tracking, and reflection, improving both efficiency and reliability.
  • Plan-based and modular SFT (Qiu et al., 22 Oct 2024): Disentangling tool arrangement (planning) from execution to alleviate arranging bottlenecks.
  • Error-driven and reflection-centric learning (Chen et al., 11 Jun 2024, Ma et al., 5 Jun 2025): Incorporating failed explorations, stepwise preference data, meta-verification, and explicit Error→Reflection→Correction trajectories to improve robustness and generalization.
  • Self-verification (Mekala et al., 21 Feb 2024): Contrastive question-asking to resolve subtle tool/parameter distinctions, boosting generalization to unseen APIs.

5. Empirical Findings and Quantitative Outcomes

ToolBench has enabled systematic comparison of state-of-the-art LLM agents and training strategies. Selected highlights:

Model / Framework Pass Rate / Correctness Setting Relative Gain
Baseline Open-source LLMs 10–37% Classic tasks
GPT-4 (ICL) ≈60% Large-scale
ToolLLaMA / CoT+DFSDT ≈50% Multi-api, DFS search +13% vs CoT
ProTIP +24% R@10 Tool retrieval vs best TD
Sum2Act 70% Multi-API, all splits +3–29% vs DFSDT
TP-LLaMA (Stepwise DPO) 65% Multi-step/all splits +12% vs SFT
xLAM (open SOTA) 0.53–0.59 Various splits ≈GPT-4 parity
SWIFT (agent SFT infra) 60% Act.EM Agent tool-use, 7B LLM +5–22% over base
DeepAgent (agentic RL) 69% Pass@1 ToolBench +14–25% vs baselines

Reliable tool-use at scale now routinely exceeds 70% pass rate or correctness for open 7B–32B parameter LLMs, approaching or matching previously proprietary baselines. Reflection-empowered LLMs (Tool-MVR (Ma et al., 5 Jun 2025)) set new records for both accuracy (up to +24% over ToolLLM) and error recovery rates (ECR 58.9%).

6. Impact and Ongoing Extensions

ToolBench has directly shaped the research landscape of tool-augmented LLMs:

  • Large Action Model scaling and fine-tuning: Enables data generation pipelines, robust reflection, high-quality SFT, and domain transfer, supporting leading initiatives such as xLAM (Zhang et al., 5 Sep 2024) and SWIFT (Zhao et al., 10 Aug 2024).
  • Virtual API server infrastructure: Inspired next-gen evaluation sets with simulated, reproducible APIs and stable automated assessment protocols (StableToolBench (Guo et al., 12 Mar 2024), MirrorAPI (Guo et al., 26 Mar 2025)).
  • Meta-verification and error reflection: Established new data curation and annotation standards, yielding high-precision datasets (ToolBench-V, ToolBench-R (Ma et al., 5 Jun 2025)).
  • Open-source and industry adoption: Lowered the supervision barrier for rapid tool-alignment, allowing secure, high-precision automation within enterprise environments and academic research.
  • Foundation for tool-use agent benchmarks: Adopted as the reference testbed for virtually all major agentic LLM evaluators, tool retriever pipelines, SFT methods, and reasoning agent architectures.

7. Limitations and Future Directions

Despite its scale and impact, ToolBench’s initial releases exhibited limitations:

  • Quality of automation: Up to 50% of queries, and 75% of trajectories, suffered from incompleteness or hallucinations, later addressed by rigorous multi-agent verification (Ma et al., 5 Jun 2025).
  • Evaluation instability: Real-world APIs are prone to deprecation and response drift, motivating virtual simulation and stable server infrastructure (Guo et al., 12 Mar 2024).
  • Reflection and error recovery: Early benchmarks focused on straight-line planning; subsequent iterations (ToolBench-R, RefineToolBench) added error-focused cycles.

Ongoing work expands ToolBench's scope to include refined simulation (e.g., MirrorAPI), more realistic user instruction noise (Wang et al., 31 Aug 2024), and richer, multi-modal scenarios. Future directions involve higher-fidelity human judgment benchmarks, continual tool integration, and deeper error model incorporation, maintaining ToolBench’s centrality in tool-augmented LLM agent research.


References

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ToolBench.