Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TOOLVERIFIER: Generalization to New Tools via Self-Verification (2402.14158v2)

Published 21 Feb 2024 in cs.CL
TOOLVERIFIER: Generalization to New Tools via Self-Verification

Abstract: Teaching LLMs to use tools is an important milestone towards building general assistants, but remains an open problem. While there has been significant progress on learning to use specific tools via fine-tuning, LLMs still struggle with learning how to robustly use new tools from only a few demonstrations. In this work we introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during (1) tool selection; and (2) parameter generation. We construct synthetic, high-quality, self-generated data for this goal using Llama-2 70B, which we intend to release publicly. Extensive experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines, even in scenarios where the distinctions between candidate tools are finely nuanced.

Enhancing Generalization in Tool Use for LLMs through Self-Verification

Introduction to Tool Use in LLMs

The integration of external tools into LLMs significantly augments their potential for real-world applications, making them more versatile and powerful. Despite the advancements, the rapid evolution of tools, marked by frequent updates and the introduction of new functionalities, poses a considerable challenge. Previously, the use of few-shot demonstrations has been a go-to method for incorporating new tools into LLMs. However, such methods often fall short when dealing with the vast and nuanced spectrum of tools, especially in parsing finely nuanced distinctions between similar tools.

Introducing ToolVerifier

The paper introduces ToolVerifier, a self-verification methodology designed to refine the tool selection and parameter generation process. ToolVerifier intelligently employs contrastive questioning in two distinct phases:

  1. Selecting the most appropriate tool from a set of candidates based on the user instructions.
  2. Generating and validating the parameters necessary for the tool's operation.

This approach leverages a synthetically generated dataset built upon Llama-2 70B, concentrating on reasoning about the utility and application of a diverse array of tools. Through extensive experimentation across various tasks, ToolVerifier showcases an average 22% improvement in performance over the few-shot baselines, particularly highlighting its efficacy in addressing finely nuanced distinctions between similarly functioning tools.

Dataset and Methodology

The methodology builds upon a novel dataset comprising synthetic tools and corresponding user instructions, designed to train the LLMs in selecting the correct tool and generating the necessary parameters. The dataset encapsulates a wide variety of tools, intentionally including closely related tools to enhance the model's discrimination capabilities. Each training sample is meticulously crafted to include reasoning notes, elucidating the rationale behind the selection of specific tools, thereby embedding a deeper understanding of the context and utility of each tool in the model.

Experimental Results

The empirical evaluation of ToolVerifier, conducted on the ToolBench benchmark covering 17 unseen tools across four tasks, demonstrates a significant margin of improvement over traditional few-shot and other prompting baselines. Noteworthy is the model's enhanced performance attributed to the self-verification mechanism, which independently contributes to an 8% increase in accuracy. The experiments underline the model's adeptness at generalizing to a broad spectrum of tools, underscoring the utility of the synthetic dataset and the verification methodology in facilitating nuanced tool discrimination and parameter generation.

Theoretical and Practical Implications

From a theoretical perspective, the approach extends our understanding of self-verification methodologies in the field of tool use, showcasing their potential in significantly boosting the performance of LLMs. Practically, the development of a robust mechanism for integrating a diverse range of tools with minimal human intervention opens new avenues for building more capable and versatile AI systems. This work lays a foundation for future exploration into self-verification techniques, potentially paving the way for models that can seamlessly adapt to the rapidly evolving landscape of digital tools.

Future Directions

The paper hints at numerous prospects for advancement. Extending this methodology to include compositions of tools and multi-step operations could significantly broaden the scope of tasks LLMs can perform. Further refining the self-verification process to minimize errors and experimenting with even more sophisticated models could unlock new levels of efficacy and efficiency in tool use and generalization capabilities.

Conclusion

ToolVerifier represents a significant step forward in the quest to enhance the tool use capabilities of LLMs. By smartly navigating the challenges associated with the integration and generalization of new tools, this work not only advances our understanding of the complexities involved but also opens up exciting possibilities for future research and practical applications in this vibrant field of paper.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv, abs/2211.12588.
  2. Chatcot: Tool-augmented chain-of-thought reasoning on chat-based large language models. ArXiv, abs/2305.14323.
  3. Chain-of-verification reduces hallucination in large language models. ArXiv, abs/2309.11495.
  4. Pal: Program-aided language models. ArXiv, abs/2211.10435.
  5. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452.
  6. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. ArXiv, abs/2305.11554.
  7. Unnatural instructions: Tuning language models with (almost) no human labor. ArXiv, abs/2212.09689.
  8. Tool documentation enables zero-shot tool-usage with large language models. arXiv preprint arXiv:2308.00675.
  9. Learning to reason and memorize with self-notes. arXiv preprint arXiv:2305.00833.
  10. Self: Language-driven self-evolution for large language model. ArXiv, abs/2310.00533.
  11. Self-refine: Iterative refinement with self-feedback. ArXiv, abs/2303.17651.
  12. Coarse2Fine: Fine-grained text classification on coarsely-grained annotated data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 583–594, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  13. Leveraging QA datasets to improve generative data augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9737–9750, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  14. ZEROTOP: Zero-shot task-oriented semantic parsing using large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5792–5799, Singapore. Association for Computational Linguistics.
  15. Talm: Tool augmented language models. ArXiv, abs/2205.12255.
  16. Gorilla: Large language model connected with massive apis. ArXiv, abs/2305.15334.
  17. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
  18. Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv, abs/2307.16789.
  19. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  20. Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems.
  21. Timo Schick and Hinrich Schütze. 2021. Generating datasets with pretrained language models. ArXiv, abs/2104.07540.
  22. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. ArXiv, abs/2303.17580.
  23. Screws: A modular framework for reasoning with revisions. ArXiv, abs/2309.13075.
  24. The art of llm refinement: Ask, refine, and trust. ArXiv, abs/2311.07961.
  25. Restgpt: Connecting large language models with real-world applications via restful apis. arXiv preprint arXiv:2306.06624.
  26. Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  27. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. ArXiv, abs/2306.05301.
  28. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  29. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  31. Self-instruct: Aligning language models with self-generated instructions. In Annual Meeting of the Association for Computational Linguistics.
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  33. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
  34. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. ArXiv, abs/2304.01196.
  35. On the tool manipulation capability of open-source large language models. ArXiv, abs/2305.16504.
  36. Toolbench leaderboard. https://huggingface.co/spaces/qiantong-xu/toolbench-leaderboard. Accessed: Feb 15 2024.
  37. Gpt4tools: Teaching large language model to use tools via self-instruction. ArXiv, abs/2305.18752.
  38. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  39. Teaching language models to self-improve through interactive demonstrations. ArXiv, abs/2310.13522.
  40. Self-convinced prompting: Few-shot question answering with repeated introspection. ArXiv, abs/2310.05035.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Dheeraj Mekala (19 papers)
  2. Jason Weston (130 papers)
  3. Jack Lanchantin (21 papers)
  4. Roberta Raileanu (40 papers)
  5. Maria Lomeli (20 papers)
  6. Jingbo Shang (141 papers)
  7. Jane Dwivedi-Yu (26 papers)
Citations (8)