Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (2402.14672v2)

Published 22 Feb 2024 in cs.CL and cs.AI
Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments

Abstract: The applications of LLMs have expanded well beyond the confines of text processing, signaling a new era where LLMs are envisioned as generalist agents capable of operating within complex environments. These environments are often highly expansive, making it impossible for the LLM to process them within its short-term memory. Motivated by recent research on extending the capabilities of LLMs with tools, we seek to investigate the intriguing potential of tools to augment LLMs in handling such complexity by introducing a novel class of tools, termed middleware, to aid in the proactive exploration within these massive environments. Such specialized tools can serve as a middleware layer shielding the LLM from environmental complexity. In two representative complex environments -- knowledge bases (KBs) and databases -- we demonstrate the significant potential of augmenting language agents with tools in complex environments. Notably, equipped with the middleware, GPT-4 achieves 2.8X the performance of the best baseline in tasks requiring access to database content and 2.2X in KB tasks. Our findings illuminate the path for advancing language agents in real-world applications.

Middleware for LLMs: Enhancing Language Agent Performance in Complex Environments through Customized Tools

Introduction

The expanding applications of LLMs have extended far beyond mere text processing, indicating an era where these models are envisioned as versatile language agents that can support a broad spectrum of complex real-world tasks. This paper explores the application of customized tools as a middleware layer that enables LLMs, specifically GPT-4, to significantly surpass performance baselines in navigating and executing tasks within complex databases and knowledge bases (KBs). Notably, we demonstrate a 2.8× improvement over the best baseline for database-related tasks and a 2.2× improvement for KB tasks.

Custom Tools: Bridging LLMs and Complex Environments

The core of our framework, named Fuxi, resides in the development of a comprehensive suite of tools designed for GPT-4 to interact with databases and knowledge bases proficiently. These tools are grounded in replicating human-like information-seeking behaviors for complex task execution within these environments. The tools developed span navigational aids for environment exploration and functional aids for specific operations, such as SQL query composition for databases and multi-hop reasoning in KBs. This approach essentially equips LLMs to bypass the inherent limitations of their short-term memory when dealing with expansive or intricate environments by proactively fetching and processing relevant information as required.

Methodology and Evaluation

Our methodology emphasizes the synergy between crafted tools and a reasoning algorithm, ReAct, facilitating an effective use of tools by the LLMs. Through extensive evaluations across six different LLMs on curated benchmarks featuring demanding tasks, Fuxi consistently outperformed existing baselines, showcasing substantial enhancements in LLM’s capability to interact with and execute complex tasks in both databases and KBs. Particularly, our evaluation in database environments leveraged the Bird dataset, notable for its complexity, while for KBs, a newly compiled benchmark, KBQA-Agent, was introduced to assess performance on intricate questions requiring profound engagement with the KB.

Insights and Implications

The substantial improvements observed with the introduction of Fuxi underscore the potential and necessity of tool augmentation for LLMs in handling complex real-world applications more effectively. The paper not only sets a new benchmark in the performance of LLMs in environments marked by their intricate nature but also opens up pathways for further research into the integration of LLMs in a wider variety of complex applications.

Our analysis also provides evidence that, while significant advancements have been achieved, there's a considerable margin for improvement, especially in environments without straightforward query interfaces. Furthermore, the design process of the tools, primarily based on our intuition and experience, pinpoints toward the necessity for a more structured approach in tool development to harness even greater performance gains.

Future Prospects

Moving forward, the exploration into embedding LLMs within an even broader range of complex environments stands as a promising avenue. Additionally, refining the tool development process through a more principled strategy could further enhance the efficacy of LLMs as generalist language agents. As we continue to push the boundaries of what LLMs can achieve, the integration of customized tools will undoubtedly play a pivotal role in transforming these models into more potent and versatile agents for real-world problem-solving.

Acknowledgements and Support

The efforts leading to these advancements were supported by collaborative insights from the THU KEG and OSU NLP groups, alongside practical aid from external partners including Cisco Research. This collective endeavor underlines the importance of communal effort in driving forward the boundaries of AI research and its applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA. Association for Computational Linguistics.
  2. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 1247–1250. ACM.
  3. KQA Pro: A dataset with explicit compositional programs for complex question answering over knowledge base. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 6101–6119. Association for Computational Linguistics.
  4. Grounding ’grounding’ in NLP. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 4283–4305. Association for Computational Linguistics.
  5. Teaching large language models to self-debug. CoRR, abs/2304.05128.
  6. Mind2web: Towards a generalist agent for the web. CoRR, abs/2306.06070.
  7. CRITIC: large language models can self-correct with tool-interactive critiquing. CoRR, abs/2305.11738.
  8. Don’t generate, discriminate: A proposal for grounding language models to real-world environments. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4928–4949. Association for Computational Linguistics.
  9. Beyond I.I.D.: three levels of generalization for question answering on knowledge bases. In WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, pages 3477–3488. ACM / IW3C2.
  10. Knowledge base question answering: A semantic parsing perspective. In 4th Conference on Automated Knowledge Base Construction.
  11. Yu Gu and Yu Su. 2022. ArcaneQA: Dynamic program induction and contextualized encoding for knowledge base question answering. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1718–1731, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  12. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. CoRR, abs/2305.14909.
  13. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. CoRR, abs/2305.11554.
  14. A comprehensive exploration on wikisql with table-aware word contextualization. CoRR, abs/1902.01069.
  15. Mistral 7b. CoRR, abs/2310.06825.
  16. Mixtral of experts. CoRR.
  17. StructGPT: A general framework for large language model to reason over structured data. CoRR, abs/2305.09645.
  18. Can LLM already serve as A database interface? A big bench for large-scale database grounded text-to-sqls. CoRR, abs/2305.03111.
  19. API-Bank: A benchmark for tool-augmented llms. CoRR, abs/2304.08244.
  20. Few-shot in-context learning on knowledge base question answering. In Annual Meeting of the Association for Computational Linguistics.
  21. AgentBench: Evaluating llms as agents. CoRR, abs/2308.03688.
  22. Chameleon: Plug-and-play compositional reasoning with large language models. CoRR, abs/2304.09842.
  23. Augmented language models: a survey. CoRR, abs/2302.07842.
  24. Code-style in-context learning for knowledge-based question answering. CoRR, abs/2309.04695.
  25. OpenAI. 2023a. GPT-4 technical report. CoRR, abs/2303.08774.
  26. OpenAI. 2023b. Models - OpenAI API. https://platform.openai.com/docs/models/gpt-3-5.
  27. Tool learning with foundation models. CoRR, abs/2304.08354.
  28. ToolLLM: Facilitating large language models to master 16000+ real-world apis. CoRR, abs/2307.16789.
  29. Evaluating the text-to-sql capabilities of large language models. CoRR, abs/2204.00498.
  30. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  31. Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761.
  32. Alfworld: Aligning text and embodied environments for interactive learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  33. LLM-Planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  34. Yu Su. 2023. Language agents: a critical evolutionary step of artificial intelligence. yusu.substack.com.
  35. On generating characteristic-rich question sets for QA evaluation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 562–572. The Association for Computational Linguistics.
  36. Battle of the large language models: Dolly vs LLaMA vs vicuna vs guanaco vs bard vs ChatGPT - a text-to-SQL parsing comparison. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11225–11238, Singapore. Association for Computational Linguistics.
  37. Exploring chain of thought style prompting for text-to-sql. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 5376–5393. Association for Computational Linguistics.
  38. Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 641–651. Association for Computational Linguistics.
  39. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  40. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
  41. ReAct: Synergizing reasoning and acting in language models. CoRR, abs/2210.03629.
  42. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 201–206, Berlin, Germany. Association for Computational Linguistics.
  43. DecAF: Joint decoding of answers and logical forms for question answering over knowledge bases. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  44. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 3911–3921. Association for Computational Linguistics.
  45. Variational reasoning for question answering with knowledge graph. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 6069–6076. AAAI Press.
  46. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.
  47. Seq2SQL: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yu Gu (218 papers)
  2. Yiheng Shu (9 papers)
  3. Hao Yu (195 papers)
  4. Xiao Liu (402 papers)
  5. Yuxiao Dong (119 papers)
  6. Jie Tang (302 papers)
  7. Jayanth Srinivasa (23 papers)
  8. Hugo Latapie (28 papers)
  9. Yu Su (138 papers)
Citations (18)