Towards Completeness-Oriented Tool Retrieval for Large Language Models (2405.16089v2)
Abstract: Recently, integrating external tools with LLMs has gained significant attention as an effective strategy to mitigate the limitations inherent in their pre-training data. However, real-world systems often incorporate a wide array of tools, making it impractical to input all tools into LLMs due to length limitations and latency constraints. Therefore, to fully exploit the potential of tool-augmented LLMs, it is crucial to develop an effective tool retrieval system. Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions, frequently leading to the retrieval of redundant, similar tools. Consequently, these methods fail to provide a complete set of diverse tools necessary for addressing the multifaceted problems encountered by LLMs. In this paper, we propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools. Specifically, we first fine-tune the PLM-based retrieval models to capture the semantic relationships between queries and tools in the semantic learning stage. Subsequently, we construct three bipartite graphs among queries, scenes, and tools and introduce a dual-view graph collaborative learning framework to capture the intricate collaborative relationships among tools during the collaborative learning stage. Extensive experiments on both the open benchmark and the newly introduced ToolLens dataset show that COLT achieves superior performance. Notably, the performance of BERT-mini (11M) with our proposed model framework outperforms BERT-large (340M), which has 30 times more parameters. Furthermore, we will release ToolLens publicly to facilitate future research on tool retrieval.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://vicuna.lmsys.org
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
- Uncovering ChatGPT’s Capabilities in Recommender Systems. In Proceedings of the 17th ACM Conference on Recommender Systems (RecSys ’23). ACM. https://doi.org/10.1145/3604915.3610646
- Llms may dominate information access: Neural retrievers are biased towards llm-generated texts. arXiv preprint arXiv:2310.20501 (2023).
- QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG]
- Luyu Gao and Jamie Callan. 2021. Unsupervised corpus aware language model pre-training for dense passage retrieval. arXiv preprint arXiv:2108.05540 (2021).
- Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum. In AAAI.
- Semantic models for the first-stage retrieval: A comprehensive review. ACM Transactions on Information Systems (TOIS) 40, 4 (2022), 1–42.
- Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 297–304.
- Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648.
- Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 113–122.
- Large Language Models are Zero-Shot Rankers for Recommender Systems. arXiv:2305.08845 [cs.IR]
- MetaTool Benchmark: Deciding Whether to Use Tools and Which to Use. arXiv preprint arXiv: 2310.03128 (2023).
- Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021).
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, Vol. 1. 2.
- API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3102–3116. https://doi.org/10.18653/v1/2023.emnlp-main.187
- Holistic Evaluation of Language Models. arXiv:2211.09110 [cs.CL]
- Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 [cs.CL]
- G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arXiv:2303.16634 [cs.CL]
- Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. arXiv:2104.08786 [cs.CL]
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511 (2022).
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842 (2023).
- Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014 (2023).
- Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255 (2022).
- Tool learning with foundation models. arXiv preprint arXiv:2304.08354 (2023).
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789 (2023).
- The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 (2023).
- Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems 36 (2024).
- Restgpt: Connecting large language models with real-world applications via restful apis. arXiv preprint arXiv:2306.06624 (2023).
- Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 8776–8788. https://doi.org/10.18653/v1/2023.emnlp-main.543
- Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation 28, 1 (1972), 11–21.
- Instruction Distillation Makes Large Language Models Efficient Zero-shot Rankers. arXiv:2311.01555 [cs.IR]
- ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases. arXiv preprint arXiv:2306.05301 (2023).
- Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. arXiv:2310.07712 [cs.CL]
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=wCu6T5xFjeJ
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214 (2023).
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
- Minghao Wu and Alham Fikri Aji. 2023. Style Over Substance: Evaluation Biases for Large Language Models. arXiv:2307.03025 [cs.CL]
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020).
- On the Tool Manipulation Capability of Open-source Large Language Models. arXiv preprint arXiv:2305.16504 (2023).
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).
- Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios. arXiv preprint arXiv:2401.00741 (2024).
- EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction. arXiv preprint arXiv:2401.06201 (2024).
- Dense Text Retrieval based on Pretrained Language Models: A Survey. ACM Trans. Inf. Syst. (dec 2023).
- ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING) (2024).
- Mu Zhu. 2004. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo 2, 30 (2004), 6.
- Changle Qu (5 papers)
- Sunhao Dai (22 papers)
- Xiaochi Wei (12 papers)
- Hengyi Cai (20 papers)
- Shuaiqiang Wang (68 papers)
- Dawei Yin (165 papers)
- Jun Xu (397 papers)
- Ji-Rong Wen (299 papers)