Introduction
AnyTool represents a significant contribution to the field of LLMs by introducing an agent that leverages over 16,000 APIs to address user queries without training external modules. The model integrates a hierarchical API retriever, a solver, and a self-reflection mechanism, which altogether form a closed-loop system for enhanced efficiency in query resolution. AnyTool demonstrates superior performance compared to existing models, evident through its remarkable average pass rate improvement in benchmark evaluations.
Hierarchical API Retriever
The essence of AnyTool lies in its advanced API retriever that employs a hierarchical structure to sort through a large collection of APIs efficiently. Inspired by the divide-and-conquer strategy, the retriever is comprised of meta-agents, category agents, and tool agents, which sequentially narrow down the search space by leveraging the API structure defined by Rapid API. This structure significantly mitigates constraints associated with the maximum context length in LLMs. The performance of AnyTool on various datasets reveals how pass rates enhance corresponding to the number of self-reflection rounds, with notable improvements of up to 20% across all datasets after 4-6 rounds.
Self-Reflection Mechanism
In addition to the hierarchical retriever, AnyTool features a self-reflection mechanism activated when initial solutions fail. It allows AnyTool to consider reasons for failure and previous context, leading to refined search strategies and reducing the propensity for "over-search". AnyTool's self-reflection is applied to both the API retriever and the solver, refining their operations continuously to improve overall performance.
Evaluation Protocol & Benchmarks
AnyTool proposes a revised evaluation protocol for user queries resolution, tackling a critical issue present in previous methodologies where an artificially high pass rate surfaced due to misclassification of "non-solvable" queries. By introducing AnyToolBench, a new supplementary benchmark, and employing a manual review process to ensure query solvability using specific APIs, AnyTool underlines its capability to outperform strong baselines like ToolLLM and a custom GPT-4 tailored for tool utilization, with a considerable margin of +35.4% in average pass rate on ToolBench.
Conclusion
AnyTool has set a new standard for tool utilization in LLMs, providing a compelling model that efficiently combines thousands of APIs to address complex user queries. Its hierarchical structure and self-reflective mechanism not only simplify the retrieval process but also significantly enhance the problem-solving abilities of LLMs. This achievement is firmly substantiated by its robust numerical results, and AnyTool's code availability offers the research community a valuable resource to further explore and expand upon its innovative approach.