Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models (2403.07714v4)

Published 12 Mar 2024 in cs.CL

Abstract: LLMs have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.

An Overview of StableToolBench: Enhancing Tool Learning Benchmark Stability for LLMs

The paper "StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of LLMs" addresses significant challenges related to the benchmarking of LLMs when integrated with external tools. The field of tool learning aims to leverage LLMs to orchestrate various tools for solving complex real-world tasks. However, evaluating this capability reliably and consistently over time has been problematic due to issues with the stability of both the tools and the evaluation metrics.

Motivation and Problem Statement

LLMs such as those presented by OpenAI (e.g., GPT-4) have yielded impressive results across a range of tasks when enhanced with external tool usage. Tool learning benchmarks, though valuable, have inherently been unstable due to reliance on either small-scale, human-crafted tools or expansive real-world APIs susceptible to fluctuating status and accessibility. This instability hinders reproducibility and comparability of LLM capabilities, which are vital for academic rigor and practical deployment. The paper introduces StableToolBench as a solution to these issues by establishing a synthetic yet robust environment for tool-related performance evaluation.

Core Contributions

The principal innovation of StableToolBench is its dual-component architecture comprising a virtual API server and a stable evaluation framework.

  1. Virtual API Server:
    • Caching System: This system stores the output of API calls to ensure consistent and reproducible results. The aim is to mitigate the instability of API changes over time.
    • API Simulator: Built on LLM architecture like GPT-4, this component simulates API responses by interpreting documentation and examples to provide plausible outputs. This simulation helps maintain the coverage of tool interactions even when certain real-world APIs are inaccessible.
  2. Stable Evaluation System:
    • This system introduces new metrics, notably the Solvable Pass Rate (SoPR) and Solvable Win Rate (SoWR), employing GPT-4 as the evaluator. This addresses the problem of randomness and inconsistency in previous benchmarks which often resulted from weaker evaluators and changing task solvability over time.

Empirical Evaluation

The authors conducted extensive testing to validate the stability and efficacy of StableToolBench. Results indicate that the virtual API server significantly reduces the degradation in performance related to API instability. The API simulator passed a "Turing test," demonstrating that humans could not consistently distinguish between its outputs and those from real APIs. Moreover, the stable evaluation system outperformed previous systems by aligning more closely with human judgment, thus providing more reliable assessments of LLM tool integration capabilities.

Implications and Future Directions

This work has substantial implications for both theoretical research and practical applications of LLMs. From a theoretical perspective, StableToolBench provides a robust framework for future studies on LLMs with tool learning, facilitating more reliable performance assessments. Practically, these benchmarks could guide improvements in the development of LLM-driven applications in business, healthcare, and other domains where consistent tool interaction is crucial.

Future research may focus on improving the fidelity of the API simulator using more advanced LLMs, potentially enhancing performance with open-source models as they become more capable. Additionally, expanding the scope to include more diverse tool interactions could make the benchmark more comprehensive, thus covering a broader array of real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Large language models as tool makers. In The Twelfth International Conference on Learning Representations.
  3. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915.
  4. T-eval: Evaluating the tool utilization capability step by step. arXiv preprint arXiv:2312.14033.
  5. Gemini Team. 2023. Gemini: A family of highly capable multimodal models.
  6. Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962.
  7. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554.
  8. Tool documentation enables zero-shot tool-usage with large language models. arXiv preprint arXiv:2308.00675.
  9. Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128.
  10. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. ArXiv.
  11. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
  12. Api-bank: A benchmark for tool-augmented llms.
  13. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842.
  14. Umap: Uniform manifold approximation and projection. The Journal of Open Source Software, 3(29):861.
  15. Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
  16. Webgpt: Browser-assisted question-answering with human feedback.
  17. OpenAI. 2023. Gpt-4 technical report.
  18. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
  19. Webcpm: Interactive web search for chinese long-form question answering. arXiv preprint arXiv:2305.06849.
  20. Tool learning with foundation models. arXiv preprint arXiv:2304.08354.
  21. Toolllm: Facilitating large language models to master 16000+ real-world apis.
  22. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  23. Tptu: Task planning and tool usage of large language model-based ai agents. arXiv preprint arXiv:2308.03427.
  24. Toolformer: Language models can teach themselves to use tools. ArXiv preprint, abs/2302.04761.
  25. Restgpt: Connecting large language models with real-world applications via restful apis. arXiv preprint arXiv:2306.06624.
  26. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.
  27. Llama: Open and efficient foundation language models.
  28. Alan M. Turing. 2009. Computing Machinery and Intelligence, pages 23–65. Springer Netherlands, Dordrecht.
  29. Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv preprint arXiv:2309.10691.
  30. Chain-of-thought prompting elicits reasoning in large language models.
  31. On the tool manipulation capability of open-source large language models.
  32. Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling. arXiv preprint arXiv:2306.11489.
  33. Gpt4tools: Teaching large language model to use tools via self-instruction.
  34. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, volume 35, pages 20744–20757. Curran Associates, Inc.
  35. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR).
  36. Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios.
  37. Toolqa: A dataset for llm question answering with external tools. arXiv preprint arXiv:2306.13304.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhicheng Guo (18 papers)
  2. Sijie Cheng (23 papers)
  3. Hao Wang (1119 papers)
  4. Shihao Liang (11 papers)
  5. Yujia Qin (41 papers)
  6. Peng Li (390 papers)
  7. Zhiyuan Liu (433 papers)
  8. Maosong Sun (337 papers)
  9. Yang Liu (2253 papers)
Citations (17)
Github Logo Streamline Icon: https://streamlinehq.com