ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios (2401.00741v3)
Abstract: Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for LLMs with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined, diverging from genuine needs. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary between LLMs and the physical world. Evaluations involving ten LLMs across three categories reveal a preference for specific scenarios and limited cognitive abilities in tool learning. Intriguingly, expanding the model size even exacerbates the hindrance to tool learning. The code and data are available at https://github.com/Junjie-Ye/ToolEyes.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
- Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
- Welch Bl. 1947. The generalisation of student’s problems when several different population variances are involved. Biometrika, 34(1-2):28–35.
- Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- How robust is GPT-3.5 to predecessors? A comprehensive study on language understanding tasks. CoRR, abs/2303.00293.
- Fortify the shortest stave in attention: Enhancing context awareness of large language models for effective tool use.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
- Evaluating large language models: A comprehensive survey. CoRR, abs/2310.19736.
- Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. CoRR, abs/2305.11554.
- Metatool benchmark for large language models: Deciding whether to use tools and which to use. CoRR, abs/2310.03128.
- Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. CoRR, abs/2304.09667.
- Scaling laws for neural language models. CoRR, abs/2001.08361.
- Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 3102–3116. Association for Computational Linguistics.
- Chameleon: Plug-and-play compositional reasoning with large language models. CoRR, abs/2304.09842.
- GEAR: augmenting language models with generalizable and efficient tool resolution. CoRR, abs/2307.08775.
- Augmented language models: a survey. CoRR, abs/2302.07842.
- Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- TALM: tool augmented language models. CoRR, abs/2205.12255.
- Gorilla: Large language model connected with massive apis. CoRR, abs/2305.15334.
- Tool learning with foundation models. CoRR, abs/2304.08354.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. CoRR, abs/2307.16789.
- Code llama: Open foundation models for code. CoRR, abs/2308.12950.
- TPTU: task planning and tool usage of large language model-based AI agents. CoRR, abs/2308.03427.
- Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761.
- Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580.
- Animal tool behavior: the use and manufacture of tools by animals. JHU Press.
- Restgpt: Connecting large language models with real-world applications via restful apis. CoRR, abs/2306.06624.
- Roformer: Enhanced transformer with rotary position embedding. CoRR, abs/2104.09864.
- Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. CoRR, abs/2306.05301.
- Nexusflow.ai team. 2023. Nexusraven-v2: Surpassing gpt-4 for zero-shot function calling.
- Lamda: Language models for dialog applications. CoRR, abs/2201.08239.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. CoRR, abs/2302.01560.
- Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
- Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671.
- On the tool manipulation capability of open-source large language models. CoRR, abs/2305.16504.
- Gpt4tools: Teaching large language model to use tools via self-instruction. CoRR, abs/2305.18752.
- Foundation models for decision making: Problems, methods, and opportunities. CoRR, abs/2303.04129.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- A comprehensive capability analysis of GPT-3 and GPT-3.5 series models. CoRR, abs/2303.10420.
- Toolqa: A dataset for LLM question answering with external tools. CoRR, abs/2306.13304.
- Junjie Ye (66 papers)
- Guanyu Li (10 papers)
- Songyang Gao (28 papers)
- Caishuang Huang (13 papers)
- Yilong Wu (11 papers)
- Sixian Li (12 papers)
- Xiaoran Fan (23 papers)
- Shihan Dou (46 papers)
- Qi Zhang (784 papers)
- Tao Gui (127 papers)
- Xuanjing Huang (287 papers)
- Tao Ji (28 papers)