Papers
Topics
Authors
Recent
Search
2000 character limit reached

API-Bank Benchmark Overview

Updated 6 February 2026
  • API-Bank Benchmark is a comprehensive evaluation suite that measures and enhances LLM tool-use capabilities by testing API calling, retrieval, and planning in real-world contexts.
  • It employs a structured multi-level evaluation—Call, Retrieve+Call, and Plan+Retrieve+Call—using both real and synthetic datasets with detailed annotation pipelines.
  • The benchmark identifies key challenges such as API hallucination and parameter extraction errors, guiding improvements in dynamic API retrieval and constrained decoding techniques.

API-Bank Benchmark is a comprehensive evaluation suite and data resource for measuring, training, and diagnosing the tool-use capabilities of LLMs, particularly those augmented to interact with external application programming interfaces (APIs). Originating with the work of Li et al. (2023), API-Bank sets out to answer foundational questions on the effectiveness, reliability, and improvement of LLMs in real-world tool usage, establishing a standard for both model evaluation and data quality in this emerging area (Li et al., 2023).

1. Problem Definition and Task Spectrum

API-Bank operationalizes tool-augmented LLM performance across three increasingly difficult “ability levels”: Call, Retrieve+Call, and Plan+Retrieve+Call. Each ability level is defined as follows (Li et al., 2023):

  • Call: Given a dialogue history where all APIs are made available, the model is tasked with selecting and invoking the correct endpoint(s) based on the current user prompt.
  • Retrieve+Call: The model is not told upfront which APIs are available. For each turn, it must extract intent from the user's request, retrieve the most relevant API via a retrieval module (using cosine similarity over sentence embeddings), and then generate the correct invocation.
  • Plan+Retrieve+Call: The model receives a complex user request that entails planning and executing a sequence of API calls, requiring both retrieval and correct execution order.

Inputs always include multi-turn dialogue context and, optionally, formal API documentation. The principal output is a structured API call string (name, parameters) compliant with the API definition and responsive to the user's intent. End-to-end correctness, including both planning and parameter extraction, is a central evaluation axis.

2. Dataset Construction and Annotation Pipeline

API-Bank incorporates two major data resources:

  • Evaluation System: 73 executable, real-world APIs, integrated in Python for stable and reproducible interaction, including endpoints for weather, scheduling, account management, CRUD, search, text-to-image, and more. For each API, endpoints are formalized with parameter type signatures.
  • Dialogue Corpus:
    • Evaluation Set: 314 manually annotated multi-turn tool-use dialogues, containing 753 API calls. Domains are broad, spanning at least eight representative verticals (e.g., information retrieval, scheduling, healthcare).
    • Training Set: 1,888 synthetic agent-generated dialogues (via a five-stage LLM-powered pipeline) covering over 2,138 unique APIs and 1,000 domains, with 4,149 API call instances.

Annotation involves two annotators creating user queries tailored to APIs, specifying exact call signatures, and recording actual responses from the framework. Additional reviewers ensure structural and logical correctness. Approximately 78.5% of drafted dialogues pass quality control.

The synthetic data pipeline employs LLMs to propose domains, specify APIs, generate realistic user queries, execute simulated calls, and check outputs—yielding high-quality, low-cost tool-use data (Li et al., 2023).

3. Evaluation Metrics and Protocols

Metrics are tailored to capture fine-grained aspects of tool use and plan correctness:

  • API Call Accuracy:

Accuracy=1Ni=1N1[c^i=ci]\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\hat{c}_i = c_i]

where NN is the total number of calls, c^i\hat{c}_i is predicted, cic_i is gold.

  • API Retrieval Precision/Recall:

Precision=RpRgRp,Recall=RpRgRg\text{Precision} = \frac{|R_p \cap R_g|}{|R_p|}, \quad \text{Recall} = \frac{|R_p \cap R_g|}{|R_g|}

for predicted (RpR_p) and ground-truth (RgR_g) retrievals.

  • Planning Accuracy:

Fraction of dialogues for which the entire planned call sequence (ordering, arguments) is exactly correct.

  • Response Quality: ROUGE-L between generated and gold language responses.
  • Task Success Rate: Fraction of dialogues with all API calls and language response exceeding a reference ROUGE-L threshold.

These metrics enable evaluation across ability levels: from single-call scenarios, through retrieval-based invocations, to multi-step planning (Li et al., 2023).

4. Systems, Baselines, and Empirical Results

Baseline and advanced models are evaluated on the API-Bank suite, notably:

  • Zero-shot LLMs: Alpaca-7B, ChatGLM-6B, GPT-3 (Davinci), GPT-3.5-turbo, GPT-4.
  • Fine-tuned LLMs: Lynx (Alpaca-7B fine-tuned on API-Bank).

Key results (as summarized in Table 1 of (Li et al., 2023)):

Model Call (%) Retrieve+Call (%) Plan+Retrieve+Call (%) Overall Correctness (%) ROUGE-L
Alpaca-7B 24 0 0 - 0.032
GPT-3.5-turbo 59.4 38.5 22.0 47.2 0.427
GPT-4 63.7 37.0 70.0 60.2 -
Lynx-7B (FT) 49.9 30.4 20.0 39.6 0.379

Lynx fine-tuning lifts Alpaca’s Call accuracy by +26 points. On the Plan+Retrieve+Call setting, GPT-4 achieves the highest planning accuracy, but all models exhibit a marked drop in performance from Call to more complex scenarios.

A separate evaluation of API-Bank data embedded within API-BLEND confirms its difficulty and utility as a challenging out-of-distribution testbed, with fine-tuned models on diverse, in-distribution data (across domains) consistently outperforming purely synthetic or narrow-domain baselines (Basu et al., 2024).

5. Error Analysis and Open Challenges

Empirical error analysis indicates six major error categories (Li et al., 2023):

  1. No API Call (missed invocation)
  2. API Hallucination (nonexistent/wrong endpoint)
  3. Invalid Input Parameters
  4. False API Call Format (unparsable signature)
  5. Missing Input Parameters
  6. Runtime Exceptions

Alpaca-7B frequently omits API calls or formats them incorrectly. Lynx reduces format errors but API hallucination remains dominant (61.4%), and GPT-4’s primary failure is incorrect retrieval or omitting the necessary “API Search” step.

Open challenges identified by API-Bank include:

  • Reliable API retrieval and dynamic discovery to mitigate planning/retrieval failures.
  • Decoding constraints to ensure compliance with parameter schemas and permissible values.
  • Scaling data diversity and exploring training objectives targeting planning/parameter extraction accuracy.

FANTASE extends API-Bank-oriented evaluation by introducing state-tracked constrained decoding (SCD) with token trie masking, enabling strict faithfulness to API documentation at decode time and achieving benchmark-leading accuracy (up to 64.4% on API-Bank, surpassing GPT-4), with 1.8×–2.3× inference speedup and improved context efficiency (Wang et al., 2024).

API-Bank stands out by offering an executable, human-curated, multi-domain evaluation suite. Nevertheless, several related benchmarks complement or challenge its design:

  • API-BLEND: Integrates API-Bank examples within a large multi-domain suite designed to unify tool detection, slot filling, and sequencing under F₁-based metrics (API-F₁, Parameter-F₁, LCS-F₁), providing a broader perspective on model generalization (Basu et al., 2024).
  • ShortcutsBench: Emphasizes real-world, multi-step planning and parameter fidelity (including “Ask” and context recognition) on Apple Shortcuts, with higher complexity (average 21.46 actions/query vs. API-Bank’s ≈2.2). The design of ShortcutsBench explicitly critiques simpler benchmarks (including API-Bank) for lacking parameter-level, input-awareness, and multi-dimensional reasoning (Shen et al., 2024).
  • APIBench: Focuses on query- and code-based API recommendation, distinct from dialogue-grounded tool use; APIBench highlights batch recommendation and hybrid code completion (Peng et al., 2021).
  • Benchmarking Web API Integration Code Generation: Targets code-level correctness for RESTful API invocation, highlighting persistent errors (hallucinated endpoints, illegal arguments) in LLM-generated integration code—complementary to API-Bank’s dialogue orientation (Maninger et al., 24 Sep 2025).

A plausible implication is that, while API-Bank was foundational in defining structured tool-use evaluation for LLMs, the field is rapidly diversifying. Newer benchmarks extend the assessment to longer planning horizons, richer parameter types, and execution-level code validation.

7. Practical Impact and Future Directions

API-Bank catalyzed a wave of research into tool-augmented LLMs, supplying both rigorous, executable evaluation and a large, high-quality synthetic training corpus. It led to fine-tuned LLMs (e.g., Lynx), decoding strategies (constrained trie-based decoding), and has become an out-of-distribution testbed in newer meta-benchmarks such as API-BLEND (Basu et al., 2024, Wang et al., 2024).

Recommended future work includes:

  • Enhanced retrieval-augmented LLM architectures for improved API search and dynamic service discovery.
  • Decoding constraints that guarantee API-call and parameter compliance—either at inference (e.g., with SCD) or via auxiliary training objectives.
  • Composition of multi-hop reasoning and clarification sub-agents to handle error cases involving missing information or ambiguous queries.
  • Extension to cross-platform, distributed, and code-execution-centric settings to stress interoperability and execution fidelity.

By systematically diagnosing both successes and persistent failure modes, API-Bank remains a cornerstone for tool-use benchmarking and model development in the LLM ecosystem (Li et al., 2023, Wang et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to API-Bank Benchmark.