NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls (2409.03797v1)
Abstract: Autonomous agent applications powered by LLMs have recently risen to prominence as effective tools for addressing complex real-world tasks. At their core, agentic workflows rely on LLMs to plan and execute the use of tools and external Application Programming Interfaces (APIs) in sequence to arrive at the answer to a user's request. Various benchmarks and leaderboards have emerged to evaluate an LLM's capabilities for tool and API use; however, most of these evaluations only track single or multiple isolated API calling capabilities. In this paper, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL has a total of 300 human annotated samples divided into two types - executable and non-executable. The executable samples are curated manually by crawling Rapid-APIs whereas the non-executable samples are hand picked by human annotators from data synthetically generated using an LLM. We evaluate state-of-the-art LLMs with function calling abilities on NESTFUL. Our results show that most models do not perform well on nested APIs in NESTFUL as compared to their performance on the simpler problem settings available in existing benchmarks.
- Granite-function calling model: Introducing function calling abilities via multi-task learning of granular tasks. arXiv preprint arXiv:2407.00121, 2024.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Api-blend: A comprehensive corpora for training and benchmarking api llms, 2024. URL https://arxiv.org/abs/2402.15491.
- Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
- Gorilla openfunctions v2. 2024.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations.
- Api-bank: A comprehensive benchmark for tool-augmented llms, 2023.
- Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets. arXiv preprint arXiv:2406.18518, 2024.
- Granite code models: A family of open foundation models for code intelligence. arXiv preprint arXiv:2405.04324, 2024.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
- Exploring llm-based agents for root cause analysis. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pp. 208–219, 2024.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Restgpt: Connecting large language models with real-world restful apis. arXiv preprint arXiv:2306.06624, 2023.
- Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- A language-agent approach to formal theorem-proving. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023.
- On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
- Agentohana: Design unified data and training pipeline for effective agent learning. arXiv preprint arXiv:2402.15506, 2024.
- Kinjal Basu (49 papers)
- Ibrahim Abdelaziz (38 papers)
- Kelsey Bradford (1 paper)
- Maxwell Crouse (17 papers)
- Kiran Kate (17 papers)
- Sadhana Kumaravel (9 papers)
- Saurabh Goyal (4 papers)
- Asim Munawar (29 papers)
- Yara Rizk (15 papers)
- Xin Wang (1307 papers)
- Luis Lastras (15 papers)
- Pavan Kapanipathi (35 papers)