LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error (2403.04746v1)

Published 7 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Tools are essential for LLMs to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.

PDF HTML Abstract

Evaluation of LLMs Using Simulated Trial and Error for Tool Learning

Introduction

The utility of LLMs can be significantly enhanced by augmenting them with tools, enabling these models to interact with the external world through APIs to acquire fresh information or perform actions. While prior research has primarily focused on the expansiveness of tool integration and the ease of introducing novel tools into LLMs, the accuracy with which these models utilize tools has not been sufficiently investigated. This paper posits that current LLMs, including state-of-the-art models like GPT-4 and those specifically tuned for tool use, demonstrate a meager correctness rate between 30\% and 60\%, manifesting a gap in their reliability for practical applications.

Methodology: Simulated Trial and Error (STE)

To address this gap, the paper introduces a novel methodology, simulated trial and error (STE), inspired by the biological mechanisms underpinning tool use among humans and other animals. STE encapsulates three integral processes: trial and error, imagination, and short-term and long-term memory, to enhance an LLM's ability to learn and utilize tools effectively. The core concept involves an LLM leveraging its generative capabilities to simulate plausible tool-use scenarios ('imagination'), proceed with interactive tool engagements based on these scenarios, and adapt based on the feedback ('trial and error'). Memories play pivotal roles in guiding the depth of specific explorations through short-term recollection of recent interactions and encouraging breadth of exploration over time via long-term memory of cumulative experiences. The exploitation phase of STE involves refining the LLM's tool-use proficiency either through in-context learning (ICL) or fine-tuning, utilizing the experiences aggregated during exploration.

Experimental Findings

Comprehensive testing using APIs from ToolBench demonstrated that STE substantially improves the tool learning proficiency of LLMs. Specifically, employing STE through ICL and fine-tuning scenarios led to pronounced gains across several LLMs, notably enabling Mistral-Instruct-7B to achieve a 46.7\% improvement in correctness, surpassing the performance of GPT-4. Ablation studies further validated the significance of each STE component, with the absence of any leading to noticeable degradation in tool-use effectiveness. Additionally, the paper explores a simple strategy for continual tool learning, suggesting that reinforcement through experience replay could mitigate the issue of catastrophic forgetting, thus maintaining the LLM's proficiency over a broader toolset without compromising previously acquired capabilities or general language skills.

Implications and Future Directions

The findings underscore the importance of accurate and reliable tool use in LLM applications, presenting STE as a potent methodology for significantly enhancing LLM capabilities in this domain. The advancements suggest potential shifts toward more dynamic, interactive modes of LLM training that more closely mirror human cognitive processes of learning through engagement with the environment. Looking forward, exploring the scalability of STE, extending its principles to compositional tool use and planning, and refining continual learning strategies are promising avenues. Furthermore, the introduction of more complex environmental interactions and the incorporation of explicit user feedback could offer additional layers of realism and adaptability in the tool-learning journey of LLMs.

Conclusion

This work highlights a critical oversight in the current paradigm of tool-augmented LLM development, presenting simulated trial and error as a viable and effective solution to this challenge. By explicitly addressing and enhancing the accuracy and reliability of tool use in LLMs, STE offers a path toward more robust, competent, and versatile AI systems capable of meaningful interaction with the external world.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Boshi Wang (16 papers)
Hao Fang (88 papers)
Jason Eisner (56 papers)
Benjamin Van Durme (173 papers)
Yu Su (138 papers)

Citations (5)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1765931369652167161

https://twitter.com/ysu_nlp/status/1821736766631768445

https://twitter.com/Altimor/status/1766677448152977753

https://twitter.com/NLPiation/status/1766533834735014337

https://twitter.com/tnm/status/1766171404066803879

https://twitter.com/dippatel1994/status/1766459281484702033