State-Tracked Constrained Decoding (SCD)

Updated 6 February 2026

State-Tracked Constrained Decoding is a method that uses a prefix trie to enforce structural and semantic constraints during token generation.
It improves API call generation by masking invalid tokens, reducing computational overhead and ensuring compliance without prompt-side modifications.
Integrated in frameworks like FANTASE, SCD enhances efficiency and accuracy through output-side optimization and dynamic Trie-based token filtering.

State-Tracked Constrained Decoding (SCD) is a decoding technique for LLMs that rigorously enforces structural and semantic constraints derived from external specifications—such as API documentation—during output generation. SCD achieves this by tracking the state of decoding with respect to a prefix trie encoding all valid completions, thereby guaranteeing that every token output maintains validity with respect to the constraints at all times. The mechanism is central to architectures such as FANTASE, in which SCD forms the backbone for faithful, efficient, and doc-compliant API call generation, functioning entirely as an output-side optimization without necessitating any in-context prompt engineering or model fine-tuning (Wang et al., 2024).

1. Motivation and Context

Faithful API call generation is vital for LLM-based agents tasked with interfacing reliably with external tools. Traditional approaches generally fall into two categories: supervised fine-tuning and in-context learning. Supervised fine-tuning incurs significant resource costs for retraining whenever APIs evolve and often fails to guarantee conformance with updated API specifications. In-context learning—whether few-shot or zero-shot—offers greater adaptability but often fails to strictly enforce API syntax, argument names, types, and documentation, leading to unfaithful or incomplete calls. Lengthy prompts embedding API documentation further degrade computational efficiency and increase context window requirements (Wang et al., 2024).

State-Tracked Constrained Decoding addresses these challenges by shifting API-compliance enforcement from prompt-side (input) to output-side, ensuring that only valid completions, according to the up-to-date API specification, are permitted at every generation step.

2. Formal Problem Definition

Let $V$ denote the LLM’s full output token vocabulary, and let $x_1, \ldots, x_{t-1}$ be the tokens decoded so far for some output sequence at step $t$ . Standard autoregressive decoding computes

$x_t = \underset{v \in V}{\arg\max}\ \log P_{\mathrm{LLM}}(v\mid x_1\cdots x_{t-1}, \text{Context}).$

SCD modifies this procedure by introducing a decoding-legal set $C_t \subseteq V$ , determined by the decoder’s current state as follows:

$x_t = \underset{v \in C_t}{\arg\max}\ \log P_{\mathrm{LLM}}(v\mid x_1\cdots x_{t-1}, \text{Context}).$

The set $C_t$ is defined as all tokens that, when appended, keep the output prefix on a path toward a complete, syntactically valid specification-compliant sequence (Wang et al., 2024).

3. Token Search Trie Construction and Traversal

All valid sequences (packages, function invocations, argument structures, and permitted values) are compactly encoded in a prefix trie, where each node represents a partial token sequence $s = x_1 \cdots x_{t-1}$ . For each node, the set of legal next tokens $C_t$ is

$C_t = \{\ v \in V : s \cdot v \text{ is a valid prefix in the Trie} \ \}.$

This ensures that every extension of the output remains a prefix of a legal sequence. The Trie is typically constructed by parsing the current API documentation into allowable token-level expansions, supporting fast membership queries and dynamic updates as APIs evolve (Wang et al., 2024).

4. Decoding Algorithmic Workflow

The SCD process can be executed in either greedy or beam search modes. The algorithm initializes a set of beams, each beam tracking the current output prefix, cumulative log-probability, and current Trie node. At each step:

For each beam, SCD queries the Trie node for $C_t$ .
If $C_t$ is empty, the beam is terminated.
For greedy decoding, the LLM’s logit vector is masked to include only $C_t$ tokens, then the argmax in $C_t$ is selected.
For beam decoding, logits for all $v \in C_t$ are considered and top scoring beams retained.
Steps with a unique allowed continuation require no LLM forward pass; the single valid token can be appended directly.

The process continues until all beams are complete (i.e., reach a terminal node in the Trie) or a maximum length is reached. SCD thus reduces the number of LLM forward passes from $\mathcal{O}(T \cdot |V|)$ (unconstrained) to $\mathcal{O}(T' \cdot |C|)$ , where $T' \ll T$ (only branching points require inference) and $|C| \ll |V|$ (Wang et al., 2024).

SCD Greedy/Beam Pseudocode

procedure SCD_Decode(LLM, Trie, Context, max_len, mode):
    node ← Trie.root
    prefix ← []
    beams ← [{tokens: [], logprob: 0, node: node}]
    for t in 1..max_len do
        new_beams ← []
        for beam in beams do
            C_t ← beam.node.children_keys()  # valid next tokens
            if C_t is empty: continue  # dead end
            # Mode handling omitted for brevity
        beams ← prune_topB(new_beams, B)
        if all beams end in terminal node: break
    return beams

5. Faithfulness, Complexity, and Empirical Results

SCD guarantees that all generated sequences are valid according to the Trie constraints—no syntactically or semantically invalid API call can be produced. Faithfulness is guaranteed by construction: every beam corresponds to a Trie path, and the Trie is constructed directly from the API documentation (Wang et al., 2024).

Key empirical results for the combined SCD and reranking approach (using Alpaca-13B as the LLM backbone) on the DSTC8 (few-shot) and API Bank (zero-shot) test sets demonstrate:

SCD improves API call accuracy to 42.33% (greedy) and 44.17% (beam) on DSTC8, and 56.64% (greedy) and 62.66% (beam) on API Bank.
Adding SCD with reranking increases this to 48.88% (DSTC8) and 64.41% (API Bank)—competitive with GPT-3.5-turbo and GPT-4, which achieve 49.28–63.66%.
Inference speed is notably improved: for DSTC8, greedy decoding runs at 3.42s per sample (versus 5.32s baseline), and beam search at 6.33s (vs. 15.12s), yielding up to 2.4× faster inference.
SCD also greatly reduces sensitivity to prompt length and the presence/absence of API docs in the prompt (Wang et al., 2024).

Dataset	Baseline Greedy	+SCD Greedy	+SCD +Rerank	Beam Speedup (×)
DSTC8	37.63%	42.33%	48.88%	2.39
API Bank	24.06%	56.64%	64.41%	2.25

6. Integration with Output-Side Reranking

SCD is typically paired with a reranking mechanism to further ensure semantic faithfulness to user requests, disambiguating among multiple syntactically valid candidate calls. In FANTASE, a lightweight 125M-parameter RoBERTa-Base discriminator is trained to score candidates by match quality, given features such as the conversation, API documentation, and generated call. The final API call is then selected by maximizing

$S(c) = \log P_{\mathrm{LLM}}(c\mid\text{Context}) + \alpha f_{\text{dis}}(c,\text{Context}),$

where $\alpha$ is tuned on a validation set. Reranking with the SCD candidate set recovers correct API calls that the LLM’s top-probability output may miss, typically improving overall accuracy by 4–6 points (Wang et al., 2024).

7. Significance, Capabilities, and Applications

The principal advantage of SCD lies in its ability to guarantee strict syntactic and semantic compliance with evolving external constraints, while maintaining efficient and scalable inference. Since SCD is an output-side procedure, updating to new or altered APIs requires only Trie regeneration—no retraining or costly context expansion. This situates SCD and related output-side optimization frameworks as well-suited to dynamically evolving, resource-limited, or high-integrity environments where API correctness, inference efficiency, and compact prompts are paramount.

A plausible implication is that the trie-based SCD paradigm can generalize efficiently to other constrained text generation domains beyond API calls, including but not limited to: program synthesis under grammar constraints, controlled dialog generation, or any sequence generation task requiring strict enforcement of formal external specifications (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

FANTAstic SEquences and Where to Find Them: Faithful and Efficient API Call Generation through State-tracked Constrained Decoding and Reranking (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to State-Tracked Constrained Decoding (SCD).