Papers
Topics
Authors
Recent
2000 character limit reached

API-Bank: LLM Benchmark & Financial APIs

Updated 26 December 2025
  • API-Bank is a multi-domain concept that spans LLM tool-use evaluation benchmarks and financial API transformation in open banking ecosystems.
  • The LLM benchmark evaluates API call generation using structured decoding, synthetic dialogues, and precise metrics like exact-match accuracy.
  • The financial API aspect drives open banking transformation by enhancing security protocols, compliance, and digital innovation in traditional banks.

API-Bank is a term with multiple meanings across open banking ecosystems, LLM tool-use evaluation, and secure financial API design. In the context of LLMs, API-Bank is a comprehensive benchmark and dataset for assessing and training tool-augmented LLMs. In financial systems, it denotes the architectural transformation of traditional banks into platform-centric organizations, built around open and standardized APIs for account, payment, and data services. Across both domains, API-Bank captures the intersection of robust technical infrastructure, rigorous security modeling, and practical evaluation at scale.

1. Benchmark for Tool-Augmented LLMs

API-Bank is the first end-to-end benchmark crafted specifically to rigorously quantify, train, and analyze LLMs’ abilities to plan for, retrieve, and invoke external APIs (Li et al., 2023). Its design addresses three pivotal research questions: current LLM proficiency in tool-use, strategies for enhancement, and persistent obstacles.

The benchmark decomposes tool-use ability into three axes:

  • Call: Given an API pool and user query, can the model select and parameterize the correct call?
  • Retrieve+Call: For a large candidate pool, can the model identify and then invoke the appropriate API?
  • Plan+Retrieve+Call: Can the model autonomously orchestrate multi-step sequences of search and function invocations in response to complex user needs?

Benchmark Dataset and Evaluation System

API-Bank comprises:

  • Evaluation Suite: 73 real-world, executable APIs (including weather, search, CRUD, and image generation) with a “frozen” backend, enabling reproducible, deterministic responses.
  • Annotated Dialogues: 314 human-curated dialogues containing 753 API calls, systematically labeled for granularity and logical consistency.
  • Synthetic Training Corpus: 1,888 dialogues with 2,138 unique APIs, spanning 1,000 domains, generated via a multi-agent LLM pipeline and passing strict quality controls.
  • Evaluation Metrics: Exact-match accuracy for API calls and ROUGE-L for response quality. Retrieval precision/recall and end-to-end success rate are reported for retrieval and planning subtasks.

The benchmark supports zero-shot, few-shot, and instruction-tuned evaluations. LLMs interact with input contexts (dialogue histories and API docs/specs) and produce API calls executable by an API-Bank backend.

2. Model Architectures and the FANTASE Approach

FANTASE (“Faithful and Efficient API Call Generation through State-tracked Constrained Decoding and Reranking”) represents a methodological advance in LLM tool-use, introducing state-tracked constrained decoding (SCD) and reranking (Wang et al., 2024).

State-Tracked Constrained Decoding (SCD)

SCD strictly enforces constraints derived from API documentation during generation:

  • Extraction and Trie Construction: For each API, packages, functions, argument names (required/optional), and allowed values are parsed and mapped into a token-level constrained trie indexed by subwords.
  • Finite-State Machine (FSM) State Tracking: The decoder tracks structural position—package, function, arguments, etc.—and updates the allowed next-token set accordingly, ensuring only permissible continuations at every step.
  • Masked Decoding: Given LLM output logits, a masked distribution is formed over allowed vocabulary elements, with sampling or beam search restricted to valid continuations. The algorithm ensures exact adherence to API signatures without LLM fine-tuning.

Reranking Module

After SCD beam search generates B candidates, a lightweight RoBERTa-base discriminator reranks them:

  • Input: [CLS]〈dialogue + API-docs〉[SEP]〈candidate API call〉[SEP]
  • Scoring: Candidates are scored against the gold API call by means of a hybrid loss (mean squared error and Spearman rank).
  • Selection: The candidate with the highest reranker score is output, which empirically yields higher exact-match accuracy than SCD beam search alone.

Quantitative Performance on API-Bank

Model/Setting Exact-Match Accuracy (%)
GPT-4 63.66
GPT-3.5-turbo 59.40
Alpaca-13B (greedy) 24.06
Alpaca-13B + SCD (greedy) 56.64
Alpaca-13B + SCD (beam) 62.66
+ SCD beam + reranker 64.41
Lynx-7B (fine-tuned) 50.53

SCD delivers near-GPT-4 accuracy at nearly 1/4th the inference time of regular beam search, with the reranker yielding a further increase of approximately 2 percentage points (Wang et al., 2024).

3. API-Bank as Financial Platform: Open Banking Transformation

“API-Bank” also refers to the strategic, architectural, and organizational reconfiguration of incumbent banks around open APIs (Stefanelli et al., 2022). Post-PSD2, European financial institutions are increasingly adopting API-first platforms, exposing products and services to external parties (FinTechs, consumers) via RESTful interfaces.

Strategic Framework

Banks’ maturity and innovation are modeled on two axes:

  • Internal digitalization (API portfolio): Number and variety of internally developed/published APIs (“inside-out”).
  • External partnerships: Number of equity investments or commercial agreements with FinTechs (“outside-in”).

This produces a typology:

  • Invisible banks: Large, diversified API stacks and many external partnerships.
  • Technological hubs: Rich APIs, moderate partnerships.
  • Innovative banks: High partnerships, modest API portfolios.
  • Follower banks: Minimum regulatory compliance on both axes.

API portfolios range from a PSD2-mandated floor of 3 to approximately 20, with value-added APIs (e.g., credit assessment, personalized PFM) marking technological leaders.

Organizational and Regulatory Drivers

  • Branch transformation (e.g., increased ATM/POS focus)
  • Component-based architectures for micro-service integration
  • Chief Digital/Chief API Officer appointments
  • Sandbox environments for third-party code/testing
  • ECB/EBA guidelines on system resilience, cross-border operability, and digital inclusion

A major consequence is the shift of customer interaction from physical to digital channels, with APIs as the principal front door to the institution.

4. Formal Security Modeling and Verification of API-Bank Protocols

Security is paramount in banking APIs, and formal analyses are used to verify both protocol correctness and resilience against various attacker models (Modesti et al., 2020, Fett et al., 2019).

Formal Methods for Open Banking Protocols

  • Modeling Languages: Extended Alice & Bob notation (AnBx), Web Infrastructure Model (WIM) in Dolev-Yao style.
  • Authentication, Authorization, Integrity Goals: Encoded as secrets, agent bindings, freshness (nonces), and strongly typed messages.
  • Verification Tools: OFMC (Open-Source Fixed-Point Model-Checker) and ProVerif.
  • Protocol Flows: Typically, OAuth2 + mTLS for token, consent, and data exchange among PSU (end-user), AISP (third party), ASPSP (bank), and resource servers.

Key Properties and Mitigations

Eight core security goals cover secrecy, authentication, and integrity of client/user secrets, consent, and resource access (see Section 3, (Modesti et al., 2020)). Verification proceeds through:

  • Automated proof of these goals under unlimited sessions and strong typing.
  • Discovery and rectification of modeling flaws (replay attacks mitigated via fresh nonces).
  • Practical testing on real (e.g., NatWest) open banking sandboxes, confirming mTLS, OAuth2 grant security, and replay protection.

FAPI (Financial-grade API) Analysis

  • High-level Flows: OAuth2 Authorization Code + PKCE, JWS assertion, mTLS/Token Binding.
  • Attack Classes: Cuckoo’s Token (token substitution), endpoint misconfiguration (token replay), PKCE challenge attacks, CSRF-style session swapping.
  • Fixes: Inclusion of token issuer, at_hash binding, request JWTs, and token binding for web flows. Proofs show that these strengthen authorization, authentication, and session integrity (Fett et al., 2019).

5. Empirical Results and Error Analysis

The API-Bank benchmark yields deep insights into LLM tool-use error modes and highlights practical challenges in financial API design.

LLM Tool-Use Failures

Frequent error categories on API-Bank benchmark (Li et al., 2023, Wang et al., 2024):

  • Omitted API call
  • Hallucinated API
  • Missing/wrong input parameters
  • Malformed syntax
  • Unhandled exceptions
  • Missing argument values

SCD in FANTASE eliminates almost all missing/invalid parameter errors. Reranking ensures that the correct sequence, even if not top-ranked by sequence probability, can be recovered from the beam. Residual errors are concentrated in argument-value pairings dependent on fine-grained world knowledge or in cases of ambiguous documentation.

Financial API Practical Security

Penetration tests and practical sandboxing (e.g., replay without nonce, malformed/JWT signature errors) validate that formal models have operational significance. For real deployments:

  • All requests must be bound to client secrets/tokens
  • Freshness via nonces is essential for replay resistance
  • Whitelisting of redirect URIs and strict JSON schema adherence are non-negotiable

6. Implications, Extensions, and Open Challenges

LLM-Centric Tool Use

FANTASE demonstrates that constrained decoding coupled with lightweight supervised reranking enables API-call generation with near SOTA accuracy, requiring no LLM fine-tuning and offering significant inference speedups (Wang et al., 2024). The approach generalizes to other structured prediction domains (e.g., Text-to-SQL, JSON), provided structural FSM and trie constraints are well-defined.

A challenge remains in accommodating APIs with large or unbounded parameter/value spaces, hybrid or runtime-derived constraints, and APIs at web scale (>10,000 endpoints), where existing Trie/FSM approaches may need hierarchical or vector search augmentation.

API-Bank in Finance

Most European banks are converging toward the “invisible bank” model, wherein dense internal API portfolios and extensive external partnerships become the basis for digital value creation (Stefanelli et al., 2022). Risks include over-outsourcing (loss of core competence/data governance), digital exclusion (asymmetric access), and cyber-physical resilience (dependency on FinTech providers).

A plausible implication is that API-Bank architectures will increasingly require rigorous, ongoing formal verification not only at the protocol level but across the CI/CD workflow, especially as banks migrate to component-based and composable financial services.

References

  • "API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs" (Li et al., 2023)
  • "FANTAstic SEquences and Where to Find Them: Faithful and Efficient API Call Generation through State-tracked Constrained Decoding and Reranking" (Wang et al., 2024)
  • "Security Analysis of the Open Banking Account and Transaction API Protocol" (Modesti et al., 2020)
  • "An Extensive Formal Security Analysis of the OpenID Financial-grade API" (Fett et al., 2019)
  • "Digital financial services and open banking innovation: are banks becoming invisible?" (Stefanelli et al., 2022)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to API-Bank.