Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent (2401.07324v3)

Published 14 Jan 2024 in cs.AI and cs.CL

Abstract: LLM agents significantly extend the capabilities of standalone LLMs, empowering them to interact with external tools (e.g., APIs, functions) and complete various tasks in a self-directed fashion. The challenge of tool use demands that LLMs not only understand user queries and generate answers accurately but also excel in task planning, tool invocation, and result summarization. While traditional works focus on training a single LLM with all these capabilities, performance limitations become apparent, particularly with smaller models. To overcome these challenges, we propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer. Each component is implemented by a single LLM that focuses on a specific capability and collaborates with others to accomplish the task. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability. To effectively train this framework, we introduce a two-stage training paradigm. First, we fine-tune a backbone LLM on the entire dataset without discriminating sub-tasks, providing the model with a comprehensive understanding of the task. Second, the fine-tuned LLM is used to instantiate the planner, caller, and summarizer respectively, which are continually fine-tuned on respective sub-tasks. Evaluation across various tool-use benchmarks illustrates that our proposed multi-LLM framework surpasses the traditional single-LLM approach, highlighting its efficacy and advantages in tool learning.

An Overview of "Small LLMs Are Weak Tool Learners: A Multi-LLM Agent"

The paper "Small LLMs Are Weak Tool Learners: A Multi-LLM Agent" by Weizhou Shen et al. addresses a significant challenge in the domain of LLMs—their ability to effectively integrate and use external tools. The research highlights the limitations faced by smaller LLMs in performing task planning, tool invocation, and result summarization concurrently. As a novel solution, the authors propose decomposing these capabilities into three distinct roles: planner, caller, and summarizer, each implemented using individual LLMs.

Problem Statement

Traditional approaches often rely on training a single LLM to handle all aspects of task execution, including understanding user queries, deciding on external tool usage, and generating appropriate responses. However, smaller LLMs demonstrate clear performance restrictions when tasked with such comprehensive roles. Notably, they often falter in maintaining robust and reliable interactions with external tools, reducing their utility in real-world applications where tool usage dynamics are critical.

Proposed Framework

In response to these challenges, the paper introduces a modular multi-LLM framework, termed α\alpha-UMi, which fragments the tool-learning process into specialized components:

  1. Planner: Responsible for task planning and decision-making, deciding the sequence of actions to take for task completion.
  2. Caller: Engages with external tools by crafting accurate and efficient API requests based on the planner's decisions.
  3. Summarizer: Generates the final response for user queries by synthesizing results from the previous steps.

This decomposition facilitates each LLM's focus on a single sub-task, potentially allowing smaller models to be utilized effectively within the framework.

Training Methodology

To train the proposed multi-LLM system, the authors introduce a two-stage training paradigm named Global-to-Local Progressive Fine-Tuning (GLPFT). Initially, a backbone LLM is trained on the entire task without discrimination among sub-tasks, fostering a broad understanding of the process. Subsequently, three derivatives of this backbone are separately fine-tuned for their designated roles using task-specific datasets.

Empirical Evaluation

The framework is evaluated on prominent tool-learning benchmarks such as ToolBench and ToolAlpaca. Results reveal that the proposed multi-LLM agent consistently surpasses the performance of single-LLM configurations, with marked improvements across several metrics including Action Exact Match, Argument F1, and planning accuracy. Notably, the modular structure demonstrates significant advantages in reducing hallucinations and improving both in-domain and out-of-domain task performance.

Implications and Future Directions

The modular approach described in the paper demonstrates considerable efficacy in leveraging the capabilities of smaller LLMs when breaking down complex tasks into manageable components. The findings could lead to advancements in AI systems where integrating real-time, evolving tool ecosystems is essential.

Future work could explore optimizing the interplay between the planner, caller, and summarizer, possibly incorporating dynamic adaptability to enhance task execution in changing environments. Additionally, further research might investigate the integration of this framework with other neural architectures or varying sizes of LLMs to further scale performance while minimizing computational overhead.

In conclusion, the paper makes significant strides in addressing the identified deficits of small LLMs in tool-learning tasks through an innovative, decomposed framework, paving the way for future explorations and applications in AI-driven task automation and human-computer interaction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915.
  2. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  3. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  4. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492.
  5. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452.
  6. Significant Gravitas. 2023. Autogpt: the heart of the open-source agent ecosystem.
  7. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  8. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352.
  9. Modelscope-agent: Building your customizable agent system with open-source large language models.
  10. Yohei Nakajima. 2023. Babyagi.
  11. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  12. OpenAI. 2022. Chatgpt: Conversational ai language model. Website. https://openai.com/chatgpt.
  13. OpenAI. 2023a. Gpt-4 code interpreter.
  14. OpenAI. 2023b. Gpt-4 technical report.
  15. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
  16. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
  17. Communicative agents for software development. arXiv preprint arXiv:2307.07924.
  18. Tool learning with foundation models. arXiv preprint arXiv:2304.08354.
  19. Toolllm: Facilitating large language models to master 16000+ real-world apis.
  20. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE Computer Society.
  21. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  22. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. arXiv preprint arXiv:2303.17580.
  23. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  24. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301.
  25. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
  28. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  30. Multi-party chat: Conversational agents in group settings with humans and models. arXiv preprint arXiv:2304.13835.
  31. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.
  32. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
  33. Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling. arXiv preprint arXiv:2306.11489.
  34. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752.
  35. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381.
  36. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  37. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
  38. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823.
  39. Memorybank: Enhancing large language models with long-term memory. arXiv preprint arXiv:2305.10250.
  40. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Weizhou Shen (18 papers)
  2. Chenliang Li (92 papers)
  3. Hongzhan Chen (6 papers)
  4. Ming Yan (190 papers)
  5. Xiaojun Quan (52 papers)
  6. Hehong Chen (10 papers)
  7. Ji Zhang (176 papers)
  8. Fei Huang (408 papers)
Citations (36)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com