Context Biasing Methods in ASR & Language Models

Updated 4 July 2026

Context Biasing Methods are techniques that adjust model predictions by integrating external context (e.g., contact names, domain-specific vocabularies) during inference.
They encompass shallow, decoding-time approaches and deep neural biasing methods that modify internal representations for improved recognition performance.
Applications span automatic speech recognition and autoregressive language models, offering trade-offs in latency, computational overhead, and robustness.

Context biasing methods are techniques that alter model predictions using external information available at inference time, such as contact names, callsigns, user-specific vocabularies, slide text, demographic prompts, or situational metadata. In automatic speech recognition (ASR), they are used to improve recognition of rare, personalized, or domain-specific words without retraining the base recognizer; in autoregressive LLMs, they provide a controllable mechanism for amplifying or attenuating the effect of context in next-token prediction; and in benchmark analysis, “context bias” can denote the extent to which a task is solvable from surrounding context alone rather than from genuine word-context interaction (Nigmatulina et al., 2023, He et al., 2024, Liu et al., 2021).

1. Problem setting and major method families

In contextual ASR, the recognition target is explicitly conditioned on both acoustics and a context list. One formulation writes the task as estimating $p(\mathbf{W} \mid \mathbf{X}, \mathbf{C})$ rather than only $p(\mathbf{W} \mid \mathbf{X})$ , where $\mathbf{C}$ is a list of biasing words or phrases and may include a special no-bias entry (Huang et al., 2024). The motivating cases are consistent across the literature: contact names, app names, songs, callsigns, product names, named entities, medical terms, technical jargon, and other rare or newly introduced words are disproportionately important to users and disproportionately underrepresented in training data (Nigmatulina et al., 2023).

The papers distinguish two broad implementation styles. Shallow or decoding-time methods leave the base model largely unchanged and alter search-time scores, for example through contextual language-model interpolation, phrase graphs, shallow fusion, or score bonuses on partial matches. Neural or deep biasing methods instead inject context into hidden representations, typically through a context encoder and cross-attention over a bias list, thereby changing the internal state used for prediction (Huang et al., 2023, Andrusenko et al., 9 Aug 2025).

A classical decoding-time formulation is additive. In one ASR line, the decoder score is written as

$s(w|H) = s_G(w|H) + s_B(w|H),$

where $s_G(w|H)$ is the general language-model log-score and $s_B(w|H)$ is the contextual bias score (Kang et al., 2020). The central tradeoff is equally classical: biasing should strongly help when the utterance truly contains a contextual phrase, but it should not hallucinate contextual entities into anti-biasing or common-speech cases where the supplied list is irrelevant (Wang et al., 2023, Xu et al., 2023).

This suggests that “context biasing” is best understood as a family of conditioning mechanisms rather than a single algorithmic template. The same term covers graph rescoring, trie matching, n-gram shallow fusion, attention-based context encoders, streaming gates, post-hoc word spotting, and inference-time steering.

2. Decoding-time, graph-based, and search-based biasing

The most direct ASR methods intervene in decoding rather than in model parameters. Traditional contextual biasing often generates lattices and rescoring graphs, but this becomes problematic in online GPU decoding because partial hypotheses are emitted on the fly and lattices are not generated during streaming. In that setting, lattice rescoring adds latency and undermines the speed advantage of GPU decoding (Nigmatulina et al., 2023).

One explicit alternative is contextual biasing based on the Knuth–Morris–Pratt matching algorithm. Here, each bias phrase is treated as a pattern, and beam-search hypotheses maintain pattern-match states rather than traversing a WFST. The method defines a per-phrase potential

$f(P^b, i) = i \cdot \delta,$

aggregates it across phrases with a max, and assigns a token bonus equal to the increase in potential after a token extension. The paper states that this “simulates the classical approaches often implemented in the WFST framework,” but avoids FST construction and favors vectorization on TPUs. Its memory is linear in pattern length, and the paper gives $O(\bar{\gamma} K F B)$ complexity for shallow fusion and $O(\bar{\gamma} K B)$ for on-the-fly rescoring (Wang et al., 2023).

A related but GPU-oriented design is the phrase-boosting tree of TurboBias. TurboBias builds an Aho–Corasick-style prefix tree from tokenized phrases, assigns depth-aware boosting scores of the form $c_{0}*\beta+\ln\!\bigl(depth\bigr)$ for transitions beyond depth one, and exposes a GPU query interface that returns contextual scores for hypothesis states over the vocabulary. The shallow-fusion objective is written as

$p(\mathbf{W} \mid \mathbf{X})$ 0

and the paper reports that the framework supports CTC, transducer, and AED decoding, including greedy and beam search, with only 2–5% RTFx overhead on average and approximately 5% slowdown even at 20K phrases (Andrusenko et al., 9 Aug 2025).

NGPU-LM occupies a nearby design point but uses a tensorized statistical n-gram LM rather than a phrase tree. Its core query is batched over the whole vocabulary,

$p(\mathbf{W} \mid \mathbf{X})$ 1

which makes shallow-fusion-style biasing compatible with greedy decoding across transducers, CTC, and attention encoder-decoder models. The paper reports less than 7% computational overhead and states that, in high-resource out-of-domain settings, the method can eliminate more than 50% of the greedy-versus-beam-search accuracy gap while avoiding beam-search slowdown (Bataev et al., 28 May 2025).

Across these works, the common principle is that contextual information is converted into search-time structure: failure links, trie states, n-gram states, or graph arcs. The practical difference lies in how much structure must be built, how efficiently it maps onto dense hardware, and whether the method can affect greedy decoding or only beam search.

3. Neural and model-integrated contextual biasing in ASR

Neural contextual biasing shifts the intervention point from search to hidden-state computation. In robust acoustic and semantic biasing for neural transducers, the contextual list is injected through two adapters: an Encoder Biasing Adapter that matches audio encoder states to character-level phrase keys for acoustic similarity, and a Pred-Net Biasing Adapter that uses pretrained language-model encodings of utterance history and bias phrases for semantic disambiguation. The full Char-PLM model combines character-based acoustic biasing with PLM-based semantic biasing and yields relative WER improvements over a subword-only baseline, including stronger gains on rare and zero-shot entities (Fu et al., 2023).

A systems-oriented variant is Deferred NAM. Its central change is to move lightweight phrase selection before the expensive context encoder, so that only the top- $p(\mathbf{W} \mid \mathbf{X})$ 2 phrases are encoded with the fine-grained context module. The model adds phrase-level and wordpiece-level cross-entropy losses to supervise retrieval and context application. The paper reports that this yields up to a 16.1 times speedup, scales to 20K phrases with maximum pre-decoding delay under 33 ms, and achieves up to a 37.5% relative WER reduction over the baseline without the added losses and lightweight phrase selection pass (Wu et al., 2024).

Other neural methods expand the training signal itself. Contextual Text Injection converts unpaired text into speech-like internal representations, runs the same biasing retrieval and contextual attention used for real speech, and trains both the recognizer and the neural biasing component on paired speech-text data and text-only examples. The paper states that CTI with 100 billion text sentences achieves up to 43.3% relative WER reduction from a strong neural biasing model, and that CTI-MWER adds a further relative improvement of 23.5% (Meng et al., 2024). Early-context-injection work pursues a simpler architectural change: instead of injecting context only at the final encoder layer, it injects context at earlier layers as well, and combines this with alternative-spelling perturbation of rare words during training. On LibriSpeech, the combined method is reported to reduce rare-word error rate by 60% relative to no biasing and 25% relative to shallow fusion (Huang et al., 2024).

Neural contextual biasing has also been extended beyond short phrase lists. LCB-net targets long-context bias from synchronized slide text in audio-visual speech recognition. It uses a bi-encoder architecture with audio-context and context-audio cross-attention, plus a biasing prediction module trained with binary cross entropy to identify which contextual units are actually biased in the utterance. On SlideSpeech test data, the paper reports 9.4%, 9.1%, and 10.9% relative WER, U-WER, and B-WER reduction over the ASR model (Yu et al., 2024). A different alignment strategy appears in spike-triggered contextual biasing for Mandarin: contextual modules are applied only at CTC spike frames, enabling both implicit hidden-state biasing and explicit posterior biasing, and the system can be cascaded with shallow fusion for further gains (Huang et al., 2023).

These methods share the premise that contextual information should change representation learning itself, not merely rerank already formed hypotheses. The literature also shows that neural biasing is not monolithic: some variants emphasize retrieval supervision, some long-context modeling, some earlier encoder injection, and some text-only augmentation.

4. Streaming, GPU-native, and retraining-free specializations

A major engineering theme is preserving latency while still biasing partial hypotheses. One GPU-native approach integrates contextual biasing directly into the standard Kaldi online GPU WFST decoder. Instead of generating lattices, it precomputes the indices of arcs in $p(\mathbf{W} \mid \mathbf{X})$ 3 that correspond to contextual words or sequences, and during decoding it adds a discount factor of $p(\mathbf{W} \mid \mathbf{X})$ 4 to those arcs. Because decoding costs are minimized, the negative offset makes the desired path more attractive. The same framework supports dynamic context switching by preloading multiple context-specific arc-index lists and selecting among them segment by segment. Runtime measurements show essentially unchanged RTFX: 26.062 with no biasing, 26.061 for endpoint-rescored sequences, and 26.065 for partial-hypothesis biasing (Nigmatulina et al., 2023).

Streaming robustness also motivates adaptive gating. In adaptive contextual biasing for transducer-based streaming ASR, the concern is that always-on biasing can hallucinate contextual phrases into common speech. The proposed solution adds an Entity Detector on top of a Context-Aware Transformer Transducer and switches the context list on only when the detector predicts that a contextual phrase is present. Two variants are described: P-ED, which uses predictor-side information, and EP-ED, which also uses biased encoder embeddings. The paper reports up to 6.7% relative WER reduction on LibriSpeech common-case scenarios, up to 20.7% relative CER reduction on an internal Mandarin voice-assistant dataset, mitigation of up to 96.7% of the relative WER increase for common cases, and only negligible RTF increase (Xu et al., 2023).

Another retraining-light direction avoids modifying the decoder at all. CTC-based Word Spotter builds a compact context graph from a trie plus CTC topology, scores hypotheses directly on CTC log-probabilities, and then replaces overlapping greedy output spans with better-scoring spotted context candidates. On the GTC benchmark, the paper reports that CTC-WS improves over pyctcdecode with context biasing by +0.08 F-score and −1.58 absolute WER for CTC, and that merging CTC-WS with greedy Transducer output gives the best reported transducer result in its table: F-score 0.87 and WER 9.90%, with much lower decoding time than beam-search baselines (Andrusenko et al., 2024).

WCTC-Biasing also avoids retraining, but it intervenes earlier in the encoder. It uses wildcard CTC on intermediate-layer posteriors to detect keywords tolerant of ambiguous or partial matches, then injects the detected keyword signal into later encoder layers through self-conditioning-style inter-layer biasing. The method is explicitly described as requiring no retraining and no TTS system, and the paper reports a 29% improvement in F1 score for unknown words on TEDxJP-10K (Nakagome et al., 2 Jun 2025). A more specialized correction method addresses pronunciation–orthography mismatch in end-to-end ASR: the system uses a substitution error as a proxy to identify the relevant context phrase but boosts the correct target phrase token by token during beam search. In the reported experiments, this context biasing plus replacement approach yields up to 11% relative improvement in biased word error rate while maintaining competitive overall WER (Huber et al., 23 Jun 2025).

This part of the literature makes clear that low-latency biasing is not synonymous with one algorithm. It includes GPU arc editing, context-list gating, CTC spotting and replacement, wildcard keyword detection in intermediate layers, and interactive correction mechanisms for specific failure modes.

5. Context steering in autoregressive LLMs

In LLMs, context biasing appears as inference-time control over how strongly a context affects next-token probabilities. Context Steering defines a contextual influence function by comparing two forward passes of the same autoregressive model at each decoding step, one conditioned on context $p(\mathbf{W} \mid \mathbf{X})$ 5 and prompt $p(\mathbf{W} \mid \mathbf{X})$ 6, and one conditioned on the prompt alone: $p(\mathbf{W} \mid \mathbf{X})$ 7 The steered score is then

$p(\mathbf{W} \mid \mathbf{X})$ 8

and the final next-token distribution is the softmax of this score (He et al., 2024).

The scalar $p(\mathbf{W} \mid \mathbf{X})$ 9 is an explicit control parameter. The paper emphasizes that $\mathbf{C}$ 0 recovers ordinary context-conditioned decoding, $\mathbf{C}$ 1 collapses to the context-free prediction, positive values amplify personalization, and negative values attenuate contextual influence. Because the method works in logit space and compares the same model with and without context, it differs from prompt engineering, fine-tuning, RLHF, activation addition, and contrastive decoding. It also requires no training, no parameter updates, and no additional annotated dataset, provided that log probabilities are accessible (He et al., 2024).

Empirically, Context Steering is evaluated on personalized movie summarization and recommendation, bias mitigation on BBQ and the Implicit Association Test, and a Bayesian generative extension that treats the steered model as a forward distribution for inferring likely hidden contexts or steering strengths from generated text. The paper reports a significant Spearman correlation of $\mathbf{C}$ 2 between human personalization scores and $\mathbf{C}$ 3 in a user study, and also notes an important failure mode: very high $\mathbf{C}$ 4 values, especially with problematic demographic contexts, can produce unstable or harmful outputs (He et al., 2024).

The method broadens the notion of context biasing beyond ASR. Here the object being biased is not a beam-search path or a phrase lattice, but the token distribution itself, and the control variable is a same-model with-versus-without-context contrast.

6. Robustness, evaluation, and contextual reliability

Robustness to irrelevant context is a recurring concern. A fast unsupervised ASR method addresses this by precomputing a vocabulary-wide bias score from a class-based LLM trained on the corpus, rather than constructing a contextual LM at inference. Its bias function is

$\mathbf{C}$ 5

so rare words receive stronger boosts than common words. The paper reports WER improvement from 16.65% to 10.13% or 10.12% with relevant context, but only 0.20% or 0.33% absolute degradation even with 10,000 distractors, and only 0.03% absolute increase for the expansion method with 5,000 classes and 10,000 distractors (Kang et al., 2020).

Outside ASR, “context bias” is also a measurement problem. In lexical semantic datasets, the term refers to how much a task can be solved from context alone. The proposed normalized measures are

$\mathbf{C}$ 6

where $\mathbf{C}$ 7, $\mathbf{C}$ 8, $\mathbf{C}$ 9, and $s(w|H) = s_G(w|H) + s_B(w|H),$ 0 denote context-only, word-only, label-only, and full-input performance. The reported analysis finds that WiC-style tasks and WSD show strong context bias, retrieval-based tasks show strong target-word bias, and humans are generally less extreme than BERT on the same masked baselines (Liu et al., 2021).

Bias-benchmark evaluation for LLMs introduces a related but distinct notion: contextual reliability of benchmark items. COBIAS measures whether a stereotyped statement remains stable under plausible contextual elaborations, using the variance in model behavior across added contexts. The paper reports that COBIAS aligns with human judgment with Spearman’s $s(w|H) = s_G(w|H) + s_B(w|H),$ 1 and $s(w|H) = s_G(w|H) + s_B(w|H),$ 2, and uses the framework to compare datasets such as WinoGender, WinoBias, CrowS-Pairs, StereoSet-intrasentence, and RedditBias (Govil et al., 2024).

A persistent misconception is that adding context is uniformly beneficial. The cited work does not support that view. Always-on biasing can degrade common-word recognition, extreme steering parameters can destabilize generation, large distractor lists must often be filtered or gated, and benchmark statements themselves may be unreliable when stripped of context. This suggests that the central technical question is not merely how to inject context, but how to regulate, verify, and evaluate its influence.