Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ControlLLM: Augment Language Models with Tools by Searching on Graphs (2310.17796v3)

Published 26 Oct 2023 in cs.CV and cs.MM

Abstract: We present ControlLLM, a novel framework that enables LLMs to utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable performance of LLMs, they still struggle with tool invocation due to ambiguous user prompts, inaccurate tool selection and parameterization, and inefficient tool scheduling. To overcome these challenges, our framework comprises three key components: (1) a \textit{task decomposer} that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a \textit{Thoughts-on-Graph (ToG) paradigm} that searches the optimal solution path on a pre-built tool graph, which specifies the parameter and dependency relations among different tools; and (3) an \textit{execution engine with a rich toolbox} that interprets the solution path and runs the tools efficiently on different computational devices. We evaluate our framework on diverse tasks involving image, audio, and video processing, demonstrating its superior accuracy, efficiency, and versatility compared to existing methods. The code is at https://github.com/OpenGVLab/ControlLLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  2. Graph of Thoughts: Solving Elaborate Problems with Large Language Models, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  8. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  9. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499, 2023.
  10. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  11. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554, 2023.
  12. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071, 2022.
  13. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
  14. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  15. Jean-Claude Latombe. Robot motion planning. Springer Science & Business Media, 2012.
  16. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  17. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  18. Visual instruction tuning. In NeurIPS, 2023a.
  19. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  20. Interngpt: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662, 2023c.
  21. Cephgpt-4: An interactive multimodal cephalometric measurement and diagnostic system with visual large language model. arXiv preprint arXiv:2307.07518, 2023.
  22. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
  23. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  24. OpenAI. Chatgpt (Mar 14 version) [large language model]. 6, 2023.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
  27. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  28. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  29. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  30. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  31. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  32. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  33. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  34. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023.
  35. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  38. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
  39. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
  40. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  41. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  42. A model-based method for computer-aided medical decision-making. Artificial intelligence, 11(1-2):145–172, 1978.
  43. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023a.
  44. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023b.
  45. Gpt4tools: Teaching large language model to use tools via self-instruction, 2023.
  46. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  47. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023a.
  48. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582, 2023b.
  49. Ferret: Refer and ground anything anywhere at any granularity, 2023.
  50. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  51. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023a.
  52. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023b.
  53. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023a.
  54. Minigpt-5: Interleaved vision-and-language generation via generative vokens, 2023b.
  55. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
Citations (27)

Summary

  • The paper introduces a novel framework that decomposes tasks, uses a Thoughts-on-Graph paradigm, and efficiently schedules tool execution.
  • It demonstrates a 93% success rate on multi-modal tasks by precisely selecting and managing tools via advanced search strategies.
  • The modular design and adaptive execution engine pave the way for scalable LLM integrations in real-world, complex applications.

The paper presents ControlLLM, a novel framework designed to enhance the capabilities of LLMs by integrating multi-modal tools, allowing them to solve complex real-world tasks efficiently. The framework addresses three significant challenges associated with LLMs: ambiguous user prompts, inaccurate tool selection, and inefficient tool scheduling. ControlLLM introduces an advanced system composed of three core components: task decomposition, a Thoughts-on-Graph (ToG) paradigm, and an execution engine.

Key Components of ControlLLM

  1. Task Decomposition: This module breaks down complex tasks into clearer subtasks, laying out specific inputs and outputs. It is pivotal in managing user prompts, thereby simplifying task planning and execution. The process involves leveraging LLMs to analyze user requests and identify subtasks, described in JSON format, facilitating subsequent stages.
  2. Thoughts-on-Graph (ToG) Paradigm: The primary innovation of this framework is the ToG paradigm, which conducts a search on a pre-constructed tool graph. This graph outlines the parametric and dependency relationships among various tools. The graph-based search accommodates complex topologies, allowing for optimal tool selection and efficient task planning. Four search strategies—greedy, beam, adaptive, and exhaustive—help navigate the graph, each with distinct trade-offs regarding time complexity and solution accuracy.
  3. Execution Engine: With a robust toolbox and access to various computational resources, this component executes the solution paths generated by ToG. It can parallelize executions where necessary and revise arguments autonomously, enhancing the efficiency of the overall process.

Evaluation and Performance

ControlLLM was evaluated through an innovative benchmark consisting of tasks across different modalities, including image, audio, and video processing. Tasks were categorized based on complexity, ranging from simple interactions involving fewer APIs to more complex scenarios necessitating multiple API integrations. The framework demonstrated superior accuracy, achieving a 93% success rate in overall solution evaluation for challenging tasks compared to 59% from the best existing methods. This performance was measured using metrics tailored to evaluate tool selection, resource hallucination, type consistency, and overall solution efficacy.

Implications and Future Directions

ControlLLM offers a significant step forward in developing LLMs capable of handling multi-modal interactions efficiently. The ability to dynamically build and traverse a tool graph represents a flexible approach to task planning and execution, potentially influencing how LLMs can be structured to interact with diverse real-world scenarios.

The paper considers both the theoretical and practical implications of enabling LLMs to utilize external tools, suggesting broader applications in multi-modal dialogue systems and complex task automation. ControlLLM’s architecture, notably the adaptive ToG paradigm, could serve as a foundation for extending the use of LLMs in other domains requiring intricate reasoning and resource management.

In conclusion, the introduction of ControlLLM represents a meaningful advancement in LLM integration with external tools, addressing some of the critical limitations in current methodologies. Future developments may include expanding the toolbox to accommodate more diverse tasks and enhancing the graph’s capabilities for even more complex dependency management. Such advancements could further improve AI’s efficacy in real-world applications, providing more refined interactions between humans and machines.

Github Logo Streamline Icon: https://streamlinehq.com