Papers
Topics
Authors
Recent
Search
2000 character limit reached

Trie-Based Beam Search for ASR & LLMs

Updated 13 April 2026
  • Trie-based beam search is a decoding algorithm that uses prefix tries to efficiently integrate token-level constraints and support rare word recognition.
  • It integrates with beam search by tracking trie nodes for each hypothesis, applying dynamic rewards, and sharing KV caches to significantly reduce memory consumption.
  • Empirical results show reduced word error rates in ASR and up to 92% memory reduction in LLM decoding, making it both scalable and effective.

Trie-based beam search is a family of decoding algorithms that leverage prefix trie data structures to efficiently integrate token-level constraints or hypotheses into sequence generation with neural models, particularly for automatic speech recognition (ASR) and LLMs. By organizing target subword or word sequences as compact prefix trees, trie-based beam search enables the system to track partial matches, apply dynamic rewards, and manage shared computational state among competing beams, thereby improving rare word recall, reducing memory consumption, and increasing search efficiency.

1. Prefix Trie Construction

The foundational data structure in trie-based beam search is the prefix trie—a rooted, labeled, directed acyclic graph built over a vocabulary VV. Given a set of target sequences (e.g., rare words or contextual phrases) represented as token lists, each sequence is inserted as a unique path from the root. For ASR contextual biasing, each "hotword" or rare word ww is embedded into utterance templates and tokenized (e.g., into BPE units) before trie insertion. Variants are extracted via automatic or synthetic generation pipelines (multi-TTS, Whisper ASR decoding, syllabic filtering) (Liu et al., 25 Aug 2025).

Formally, a trie T=(N,r,E)T = (N, r, E) is defined by:

  • NN: set of nodes
  • rNr \in N: the root node
  • EN×V×NE \subseteq N \times V \times N: labeled edges for token extensions
  • Each node nNn \in N stores membership flags and (optionally) accumulated scores or terminal status.

Example pseudocode for variant insertion: T=(N,r,E)T = (N, r, E)9 This data structure supports multi-pronunciation hotwords, OOVs, and subword-based expansions (Liu et al., 25 Aug 2025, Kwok et al., 11 Sep 2025).

During autoregressive generation, each beam hypothesis maintains (1) its current trie node and (2) the accumulated match state. On each token expansion:

  • If the token continues a valid trie prefix, the node pointer advances.
  • If not, the pointer resets to root (or null), dropping partial-matching rewards.

The standard cost for a hypothesis extended by token yty_t is modified as: SH(y1:t)=SH(y1:t1)+SW(yt)ρtS_H(y_{1:t}) = S_H(y_{1:t-1}) + S_W(y_t) - \rho_t where SW(yt)=logP(yty1:t1,X)S_W(y_t) = -\log P(y_t \mid y_{1:t-1}, X) is the base model log-probability, and ww0 is a reward (typically ww1) if the trie expansion matches (Liu et al., 25 Aug 2025).

Beams are maintained as tuples of score, token sequence, current trie node, and accumulated reward. After each step, only the top-ww2 hypotheses by score are retained.

3. Reward Schemes, Shallow Fusion, and Revocation

The trie enables dynamic biasing by assigning per-token (or final-node) rewards to beams likely to generate targeted rare words or phrases. Widespread in contextual ASR, this "shallow fusion" approach is expressed as: ww3 where ww4 is ww5 if the prefix matches a trie path, ww6 is a positive bias, ww7 a penalty (Kwok et al., 11 Sep 2025).

A key challenge is reward revocation: if a hypothesis fails to complete a rare word, any previously granted bonus must be withdrawn. This introduces extra bookkeeping and can increase beam width requirements, impacting decoding speed.

K-step lookahead variants address this by modifying the neural model to jointly predict (with an auxiliary output head) the next ww8 tokens. Trie bonuses are then given only if the model's lookahead predictions corroborate that a prefix will lead to a rare word completion, eliminating the need for revocation logic (Kwok et al., 11 Sep 2025):

ww9

Bonus T=(N,r,E)T = (N, r, E)0 is only applied if lookahead projections match the expected rare word suffix. This enhances efficiency and enables safe greedy decoding, with empirical WER reductions (Kwok et al., 11 Sep 2025).

4. Memory-Efficient Decoding and KV Cache Sharing in LLMs

Trie-based beam search extends beyond context biasing to efficient decoding in large Transformer-based LLMs. When decoding with a batch beam search, each hypothesis (beam) naively maintains its own key–value (KV) cache, leading to T=(N,r,E)T = (N, r, E)1 memory for beam width T=(N,r,E)T = (N, r, E)2 and sequence length T=(N,r,E)T = (N, r, E)3. Trie-based schemes recognize that most hypotheses share long prefixes; only unique prefix nodes require dedicated storage (Chan et al., 31 Jan 2025).

Algorithmically, the trie is updated at each step by merging states with shared prefixes. Only unique nodes (prefixes) have distinct KV cache entries. Causal attention masks are constructed so that each token attends only to ancestors along its root-to-leaf branch, preserving branch isolation (Chan et al., 31 Jan 2025).

Quantitative memory and speed comparisons:

Dataset Beam Width Batch Mem/Tok (MB) Trie Mem/Tok (MB) Speed Δ (%)
CNN/DM 15 19.95 1.55 (–92.2%) +42.1%
QMSUM 9 13.21 1.56 (–88.2%) +117.3%
HumanEval 15 16.16 3.31 (–79.5%) +1.3%

Trie-based decoding can reduce KV-cache usage by 60–90% and match greedy-decoding memory footprints, without loss of output quality (Chan et al., 31 Jan 2025).

5. Computational Complexity and Implementation Attributes

Trie construction entails T=(N,r,E)T = (N, r, E)4 cost (sum over all variant lengths). Decoding overhead consists of T=(N,r,E)T = (N, r, E)5 per-step hypothesis/token expansion. Each child lookup in the trie is typically T=(N,r,E)T = (N, r, E)6 (hash or array). For LLM decoding, additional overheads include causal mask updates (T=(N,r,E)T = (N, r, E)7 depthT=(N,r,E)T = (N, r, E)8 per step) and garbage collection of pruned trie branches at intervals (Chan et al., 31 Jan 2025, Liu et al., 25 Aug 2025).

In contextual ASR, beam states retain mappings from trie terminals to canonical hotwords, enabling post-hoc variant-to-hotword transcription (Liu et al., 25 Aug 2025).

For K-step look-ahead models, additional complexity is introduced via secondary prediction heads but remains negligible compared to full sequence modeling (Kwok et al., 11 Sep 2025).

6. Empirical Performance in Contextual Biasing and Language Generation

Key results establish that trie-based beam search:

  • Cuts biased WER (rare word errors) in half on Librispeech ASR test sets, with overall WER reduced by 15–20%. Unbiased WER remains nearly unchanged, indicating negligible false-positive rate increase (Liu et al., 25 Aug 2025).
  • In contextual ASR on NSC-Part-2, a K-step adaptation reduces WER from 30.86% (vanilla) to 12.19% with 10 hours of synthetic data, outperforming naive biasing in the high-distractor regime (Kwok et al., 11 Sep 2025).
  • In LLMs, memory per token drops by up to 92% at high beam widths, decoding speed is preserved or increased, and outputs are identical in quality to standard beams (Chan et al., 31 Jan 2025).

These results demonstrate that trie-based beam search is both a scalable context-biasing strategy for ASR and a critical infrastructure for efficient LLM inference.

7. Limitations, Edge Cases, and Practical Considerations

The performance of trie-based beam search depends on the amount of prefix sharing among active beams. In the worst case (maximal divergence at every step), memory reduces to that of standard batch decoding. Mask construction overhead and garbage collection frequency can be tuned to balance latency and memory reclamation. Extra CPU memory is required to maintain trie structures, though this is marginal relative to GPU buffer sizes (Chan et al., 31 Jan 2025).

In contextual biasing, reward schedule design (per-token vs. final-node), hotword tokenization, and match specificity affect bias side effects (e.g., false positives, incomplete hotword insertions) (Liu et al., 25 Aug 2025, Kwok et al., 11 Sep 2025).

A plausible implication is that trie-based beam search generalizes to other conditional decoding settings—such as structured generation or open-domain entity recall—where token-level constraints can be compactly encoded in prefix automata.


References

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Trie-Based Beam Search.