AI Research Assistant for Computer Scientists

Papers
Topics
Authors
Recent
2000 character limit reached
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs (2307.16789)
Published 31 Jul 2023 in cs.AI, cs.CL, and cs.LG
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Overview

  • A framework named ToolLLM was developed to improve open-source LLMs' ability to work with a wide range of real-world APIs, comparing favorably to closed-source models like ChatGPT.

  • ToolBench, a dataset containing over 16,000 REST APIs, was created to train LLMs in executing APIs, using instructions and annotating solution paths, aiming to generalize to new APIs post-training.

  • An evaluation system called ToolEval measures an LLM's proficiency in executing instructions, with the fine-tuned model ToolLLaMA showing strong performance and generalization.

  • ToolLLaMA is demonstrated to adapt effectively to unseen instructions and to generalize across out-of-distribution datasets, often performing better than when provided with correct APIs.

  • The study highlights the potential of open-source LLMs for tool use and the impact of datasets and algorithms like ToolBench and DFSDT on the future of LLMs' instruction tuning capabilities.

Introduction

The integration of LLMs° with APIs to accomplish complex tasks has been a focal area of interest in AI research. Open-source models such as LLaMA° have shown versatility through various instruction tuning° approaches. However, their capabilities in tool-use domains, specifically interacting with external tools or APIs to adhere to complex human instructions, are yet to be on par with state-of-the-art (SOTA) closed-source models° like ChatGPT. To address this, a novel framework named ToolLLM has been presented, aimed at enabling open-source LLMs° to competently master a wide array of real-world APIs.

Dataset Construction

The construction of the ToolBench dataset° is a central aspect of this framework. ToolBench° is designed to help LLMs learn to execute APIs and generalize to new ones not encountered during the training phase. The dataset spans 16,464 REST APIs° across 49 categories and is devised in stages: collecting APIs, generating diverse instructions, and annotating solution paths. This dataset is unique in its coverage of both single-tool and multi-tool scenarios and is automatically constructed using ChatGPT, minimizing the need for human supervision. A distinct depth-first search-based decision tree (DFSDT) algorithm enhances LLMs' reasoning, enabling them to manage multiple reasoning traces° and improve upon existing models like ReACT.

Evaluation and Model Training

ToolEval, the automated evaluator developed alongside ToolBench, offers metrics that quantify an LLM's ability to execute instructions effectively. The fine-tuned LLaMA° model, referred to as ToolLLaMA, is equipped with a neural API retriever and demonstrates impressive capabilities in executing complex instructions with performance comparable to ChatGPT and strong generalization abilities, even on out-of-distribution tool-use datasets. The neural API retriever dispenses with the requirement for manual API selection amid a large collection, showcasing excellent precision in API recommendations.

Insights and Generalization

ToolLLaMA offers compelling evidence regarding the adaptability of open-source LLMs to unseen instructions and tools, showing results that rival those of the teacher model, ChatGPT. The generalization capabilities extend to an OOD° dataset called APIBench, where ToolLLaMA, even without training on APIBench's domains, demonstrates a noteworthy performance. Notably, ToolLLaMA combined with the API retriever surpasses the performance when utilizing ground truth APIs, arguably due to its ability to identify more appropriate APIs for a given instruction among the extensive database.

In conclusion, ToolLLM stands as a comprehensive framework that imparts high-level tool-use competencies in open-source LLMs, promoting the democratization of AI technologies and community-driven innovation. The methodologies developed within this framework, including ToolBench, DFSDT, ToolEval, and integrative API retrieval, highlight the future trajectory of instruction tuning and tool usage° in LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Do as i can, not as i say: Grounding language in robotic affordances. ArXiv preprint, abs/2204.01691, 2022.
  2. Promptsource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  93–104, 2022.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  4. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  7. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  8. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
  9. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14953–14962, 2023.
  10. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554, 2023.
  11. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  9118–9147. PMLR, 2022a.
  12. Inner monologue: Embodied reasoning through planning with language models. ArXiv preprint, abs/2207.05608, 2022b.
  13. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.
  14. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. ArXiv, 2023.
  15. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023a.
  16. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  17. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3470–3487, 2022.
  18. Webgpt: Browser-assisted question-answering with human feedback. ArXiv preprint, abs/2112.09332, 2021.
  19. OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  20. OpenAI. Gpt-4 technical report, 2023.
  21. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  22. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  23. Creator: Disentangling abstract and concrete reasonings of large language models through tool creation. arXiv preprint arXiv:2305.14318, 2023.
  24. Webcpm: Interactive web search for chinese long-form question answering. arXiv preprint arXiv:2305.06849, 2023a.
  25. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023b.
  26. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  27. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  28. Toolformer: Language models can teach themselves to use tools. ArXiv preprint, abs/2302.04761, 2023.
  29. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface, 2023.
  30. Reflexion: Language agents with verbal reinforcement learning, 2023.
  31. Restgpt: Connecting large language models with real-world applications via restful apis. arXiv preprint arXiv:2306.06624, 2023.
  32. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023.
  33. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  36. Chatgpt for robotics: Design principles and model abilities. Technical Report MSR-TR-2023-8, Microsoft, February 2023.
  37. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  38. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  39. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  40. Visual chatgpt: Talking, drawing and editing with visual foundation models. ArXiv preprint, abs/2303.04671, 2023.
  41. Wizardlm: Empowering large language models to follow complex instructions, 2023a.
  42. On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023b.
  43. Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling. arXiv preprint arXiv:2306.11489, 2023.
  44. React: Synergizing reasoning and acting in language models. ArXiv preprint, abs/2210.03629, 2022.
  45. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  46. Large language model as autonomous decision maker. arXiv preprint arXiv:2308.12519, 2023.
  47. Toolqa: A dataset for llm question answering with external tools. arXiv preprint arXiv:2306.13304, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Yujia Qin (37 papers)
  2. Shihao Liang (9 papers)
  3. Yining Ye (9 papers)
  4. Kunlun Zhu (8 papers)
  5. Lan Yan (5 papers)
  6. Yaxi Lu (7 papers)
  7. Yankai Lin (108 papers)
  8. Xin Cong (31 papers)
  9. Xiangru Tang (46 papers)
  10. Bill Qian (3 papers)
  11. Sihan Zhao (10 papers)
  12. Lauren Hong (5 papers)
  13. Runchu Tian (5 papers)
  14. Ruobing Xie (84 papers)
  15. Jie Zhou (524 papers)
  16. Mark Gerstein (20 papers)
  17. Dahai Li (7 papers)
  18. Zhiyuan Liu (353 papers)
  19. Maosong Sun (286 papers)