Evaluation of LLMs Using Simulated Trial and Error for Tool Learning
Introduction
The utility of LLMs can be significantly enhanced by augmenting them with tools, enabling these models to interact with the external world through APIs to acquire fresh information or perform actions. While prior research has primarily focused on the expansiveness of tool integration and the ease of introducing novel tools into LLMs, the accuracy with which these models utilize tools has not been sufficiently investigated. This paper posits that current LLMs, including state-of-the-art models like GPT-4 and those specifically tuned for tool use, demonstrate a meager correctness rate between 30\% and 60\%, manifesting a gap in their reliability for practical applications.
Methodology: Simulated Trial and Error (STE)
To address this gap, the paper introduces a novel methodology, simulated trial and error (STE), inspired by the biological mechanisms underpinning tool use among humans and other animals. STE encapsulates three integral processes: trial and error, imagination, and short-term and long-term memory, to enhance an LLM's ability to learn and utilize tools effectively. The core concept involves an LLM leveraging its generative capabilities to simulate plausible tool-use scenarios ('imagination'), proceed with interactive tool engagements based on these scenarios, and adapt based on the feedback ('trial and error'). Memories play pivotal roles in guiding the depth of specific explorations through short-term recollection of recent interactions and encouraging breadth of exploration over time via long-term memory of cumulative experiences. The exploitation phase of STE involves refining the LLM's tool-use proficiency either through in-context learning (ICL) or fine-tuning, utilizing the experiences aggregated during exploration.
Experimental Findings
Comprehensive testing using APIs from ToolBench demonstrated that STE substantially improves the tool learning proficiency of LLMs. Specifically, employing STE through ICL and fine-tuning scenarios led to pronounced gains across several LLMs, notably enabling Mistral-Instruct-7B to achieve a 46.7\% improvement in correctness, surpassing the performance of GPT-4. Ablation studies further validated the significance of each STE component, with the absence of any leading to noticeable degradation in tool-use effectiveness. Additionally, the paper explores a simple strategy for continual tool learning, suggesting that reinforcement through experience replay could mitigate the issue of catastrophic forgetting, thus maintaining the LLM's proficiency over a broader toolset without compromising previously acquired capabilities or general language skills.
Implications and Future Directions
The findings underscore the importance of accurate and reliable tool use in LLM applications, presenting STE as a potent methodology for significantly enhancing LLM capabilities in this domain. The advancements suggest potential shifts toward more dynamic, interactive modes of LLM training that more closely mirror human cognitive processes of learning through engagement with the environment. Looking forward, exploring the scalability of STE, extending its principles to compositional tool use and planning, and refining continual learning strategies are promising avenues. Furthermore, the introduction of more complex environmental interactions and the incorporation of explicit user feedback could offer additional layers of realism and adaptability in the tool-learning journey of LLMs.
Conclusion
This work highlights a critical oversight in the current paradigm of tool-augmented LLM development, presenting simulated trial and error as a viable and effective solution to this challenge. By explicitly addressing and enhancing the accuracy and reliability of tool use in LLMs, STE offers a path toward more robust, competent, and versatile AI systems capable of meaningful interaction with the external world.