API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs (2402.15491v2)

Published 23 Feb 2024 in cs.CL and cs.AI

Abstract: There is a growing need for LLMs to effectively use tools and external Application Programming Interfaces (APIs) to plan and complete tasks. As such, there is tremendous interest in methods that can acquire sufficient quantities of train and test data that involve calls to tools / APIs. Two lines of research have emerged as the predominant strategies for addressing this challenge. The first has focused on synthetic data generation techniques, while the second has involved curating task-adjacent datasets which can be transformed into API / Tool-based tasks. In this paper, we focus on the task of identifying, curating, and transforming existing datasets and, in turn, introduce API-BLEND, a large corpora for training and systematic testing of tool-augmented LLMs. The datasets mimic real-world scenarios involving API-tasks such as API / tool detection, slot filling, and sequencing of the detected APIs. We demonstrate the utility of the API-BLEND dataset for both training and benchmarking purposes.

PDF Abstract

Comprehensive Benchmark for API-Augmented LLMs

Introduction to API-BLEND

In recent developments within the field of LLMs, there has been a significant shift towards augmenting these models with external Application Programming Interfaces (APIs). This advancement enables LLMs to execute a wider array of tasks by leveraging external tools and databases, extending their applicability beyond mere text generation. Recognizing the need for a robust dataset that facilitates both the training and systematic benchmarking of such tool-augmented LLMs, Basu et al. introduce API-BLEND. This dataset not only addresses the gap in existing training and evaluation materials but also sets a new standard in assessing API-augmented model performance.

Dataset Overview

API-BLEND is distinct in its comprehensive coverage, including both synthetic and real-world scenarios where API invocation is necessary. It comprises 10 datasets, of which five are designated for training and the remainder for out-of-domain (OOD) evaluation. This blend enriches the dataset with diverse API data, leading to improved OOD generalization performance, a crucial metric given the model's intended real-world application across varied domains.

The datasets within API-BLEND emphasize API/task detection, slot filling, and the sequencing of detected APIs—key functionalities that enable high-level task completions via LLMs. An innovative aspect of API-BLEND is its inclusion of datasets that focus on sequencing, a relatively underexplored yet critical capability for executing complex tasks.

Technical Insights and Evaluation

Basu et al.’s work meticulously outlines the process of curating the API-BLEND dataset. It employs a hybrid approach combining human-annotated data with LLM-assisted generation to encompass over 150,000 instances across training, development, and test sets. This methodology not only ensures richness in the dataset's contextual and API-related diversity but also addresses common pitfalls experienced in synthetic data generation, such as bias and lack of diversity. Moreover, the paper benchmarks existing LLMs against API-BLEND, revealing substantial improvements in models trained on this dataset. This improvement is quantified through rigorous evaluation metrics, including F1 scores for API and slot/parameters detection, and Longest Common Subsequence (LCS) for assessing the sequence accuracy of API calls.

Implications and Future Directions

API-BLEND’s introduction marks a significant stride towards enhancing LLMs’ effectiveness in interfacing with external databases and tools, a developing field with widespread practical applications. By providing a robust dataset for training and benchmarking, this paper facilitates further research into optimizing LLMs for API usage, potentially leading to more sophisticated, context-aware, and capable AI systems.

Looking ahead, API-BLEND’s architecture and initial findings lay the groundwork for future explorations. One avenue is investigating how LLMs can be made more efficient in real-time API invocations within dynamic environments, a scenario that the current dataset begins to address. Additionally, expanding API-BLEND to include more natural language variations and multilingual support could further enhance LLMs' global applicability and utility.

In conclusion, API-BLEND represents a pivotal development in the field of tool-augmented LLMs. By providing a detailed framework for dataset curation and a comprehensive benchmarking methodology, this paper underscores the importance of API augmentation in advancing LLM capabilities. Future research, guided by the insights and resources provided by API-BLEND, is poised to unlock new frontiers in AI's practical application and efficiency in task execution.