FANTASE: Faithful API Call Generation

Updated 6 February 2026

FANTASE Approach is a framework integrating state-tracked constrained decoding and lightweight reranking to enforce strict adherence to API specifications.
It improves accuracy and efficiency, achieving up to 2.4× faster inference and notable gains in exact-match API call accuracy.
The method is adaptable to evolving APIs, working with any autoregressive LLM without modifying model weights or relying on extensive prompt engineering.

FANTASE (FAN-TAstic SEquences and Where to Find Them) refers to a framework for faithful and efficient API call generation that augments autoregressive LLMs with output-side constrained decoding and lightweight reranking mechanisms. The approach directly addresses critical shortcomings of supervised and in-context learning methods in generating API calls, specifically the lack of faithfulness to API specifications and suboptimal use of compute and data resources. FANTASE is compatible with any autoregressive LLM and operates without modifying model weights or substantially increasing prompt length, thus ensuring adaptability to evolving APIs and deployment in resource-constrained environments (Wang et al., 2024).

1. Motivation and Problem Formulation

The rising adoption of LLMs for tool use hinges on their ability to generate valid and faithful API calls, but standard approaches face crucial difficulties:

Specification Faithfulness: Models often emit API calls violating type, parameter, or value constraints, even if calls are syntactically plausible, resulting in frequent non-executable outputs. In Table 1 of (Wang et al., 2024), top-1 exact-match accuracy using standard beam search reaches only ~41% even when the gold call exists in the beam.
Data and Compute Efficiency: Supervised fine-tuning methods require large labeled corpora and are costly to maintain as APIs evolve. In-context learning with long prompts and exemplars does not guarantee faithful outputs or scalable deployment.

FANTASE is designed as an output-side optimization layer atop fixed LLMs to guarantee:

API specification faithfulness via per-token constrained search.
Enhanced accuracy and context/data efficiency.
Rapid inference through sub-vocabulary pruning.

2. State-Tracked Constrained Decoding (SCD)

FANTASE’s central mechanism is State-Tracked Constrained Decoding, which introduces an explicit token-level validity filter for the decoding process.

Token Search Trie Construction

API documentation, including function names, parameter names, and legal values, is preprocessed into a Constrained Token Search Trie (CTST). Each path from the root to a leaf node encodes a complete valid API call, tokenized to match the model’s vocabulary.

Formally, at generation step $t$ with prefix $x_1...x_{t-1}$ and vocabulary $V$ , the set of valid next tokens is:

$C_t = \left\{ v \in V \mid \text{Trie.has\_prefix}(x_1...x_{t-1}v) \right\}$

This enables explicit enforcement of API documentation constraints at every generation step.

SCD Decoding Algorithm

The SCD process operates as follows (see the pseudocode in (Wang et al., 2024)):

Beam or Greedy Search: For each hypothesis in the beam, query the current trie node to determine $C_t$ , mask the LLM logits for invalid tokens ( $\forall v \notin C_t, \ \mathrm{logits}[v] \leftarrow -\infty$ ), and expand only valid continuations.
Faithfulness Guarantee: By construction, every output sequence remains compliant with the API specification.

Computational Efficiency

Time complexity per decoding step is $O(|C_t|)$ (number of valid continuations) plus trie lookup cost, both typically much smaller than $O(|V|)$ (the full vocabulary size). Empirically, SCD yields a $1.5\times$ – $2.4\times$ speedup in inference over unconstrained decoding, with beam size and constraint tightness influencing the gain (see Table 4 in (Wang et al., 2024)).

3. Lightweight Reranking

While SCD ensures faithfulness, it does not guarantee the assignment of highest probability to the correct call. To resolve ranking failures, FANTASE introduces a lightweight reranker.

Discriminator Architecture and Training

Model: A RoBERTa-base discriminator ($125$M parameters) $D_\theta$ takes as input: $[\mathrm{CLS}]$ context + API-doc summary + candidate API call $[\mathrm{SEP}]$ .
Training: Beam search with SCD provides $N$ candidates per example, each labeled with a ground truth match $y_i \in \{0,1\}$ .
Objective: Combines mean-squared error (MSE) loss and Spearman soft-rank correlation loss for calibrated ranking:

$L_\mathrm{MSE} = \sum_i (R(c_i) - y_i)^2$

Candidate Scoring and Final Reranking

Each candidate $c$ is scored as

$S(c) = \log P_\mathrm{LM}(c|ctx) + \alpha R(c)$

where $P_\mathrm{LM}$ is the SCD log-probability, $R(c)$ is the discriminator output normalized (zero mean, unit variance within the beam), and $\alpha$ controls reranker influence.

This procedure efficiently injects supervised signal at inference with negligible computational overhead compared to fine-tuning.

4. FANTASE Inference Pipeline

The full inference sequence proceeds as:

Prompt Assembly: [System instructions + possible few-shot exemplars + user utterance].
SCD Decoding: For each generation step, update $C_t$ from the CTST, mask logits, and produce a beam of faithfulness-guaranteed API call candidates.
Reranker Evaluation: For each candidate, compute $R(c)$ with the discriminator.
Scoring and Selection: Combine base SCD log-probability and reranker output; select the top-scoring API call for output.

This output-side methodology differs from classical fine-tuning or prompt engineering: it does not require changing LLM weights or bloating the context window.

5. Empirical Results

FANTASE was extensively evaluated on two API-call generation benchmarks:

Dataset	Setting	Baseline Beam	SCD Beam	SCD + Rerank
DSTC8	Few-shot ( $\sim$ 2-shot)	40.49%	44.17%	48.88% (≈GPT-3.5)
API Bank	Zero-shot	24.31%	62.66%	64.41% ( $\gg$ GPT-3.5)

Inference Speed: SCD greedy decoding yields $1.56\times$ faster inference ($3.42$s/sample vs $5.32$s), and SCD beam reaches $2.39\times$ speedup ($6.33$s vs $15.12$s).
Context Efficiency: SCD retains higher accuracy after API documentation removal from prompts ( $-1.63\%$ vs $-2.25\%$ on DSTC8; retains $\approx 23\%$ zero-shot accuracy on API Bank).
Faithfulness: By design, all SCD outputs conform exactly to API documentation; reranking ensures high-likelihood, correct, but low-preference items can be preferred.

6. Practical Considerations and Applicability

FANTASE offers several practical advantages:

Adaptability to API Changes: No LLM retraining is required to accommodate updated API specifications—updating the trie suffices.
Data Scarcity: The output-side orientation enables high accuracy in zero-shot or low-shot domains where labeled fine-tuning data is scarce.
Resource Efficiency: Speedups and compact prompts suit latency-sensitive or context-limited deployments (e.g., edge inference).
Decoupling from LLM Knowledge: Faithfulness is enforced independently of the LLM’s inherent internalization of API semantics, preventing hallucinated or outdated API calls (Wang et al., 2024).

A plausible implication is that FANTASE enables reliable, faithful integration of LLMs with evolving tool APIs in practical, real-world settings without the data, cost, or maintenance burden of retraining-based approaches.

7. Limitations and Prospects

FANTASE’s effectiveness depends on the specification expressiveness and tractability of constructing the CTST for the task’s API set. Its reranking capacity is limited by the discriminative power and domain coverage of the lightweight RoBERTa-based discriminator. Scenarios with extremely large or highly ambiguous API spaces may necessitate further trie or beam management optimization. Nevertheless, the paradigm exemplifies a shift toward output-side, constraint-aware postprocessing as an efficient bridge between powerful LLMs and reliability-critical tool-use applications (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

FANTAstic SEquences and Where to Find Them: Faithful and Efficient API Call Generation through State-tracked Constrained Decoding and Reranking (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FANTASE Approach.