Lattice Rescoring Protocols in ASR

Updated 21 December 2025

Lattice rescoring protocols are methods that refine automatic speech recognition by re-ranking decoding lattices using advanced language models, balancing efficiency and improved accuracy.
They integrate traditional n-gram approaches with neural models like LSTM and Transformer via algorithms such as push-forward, achieving significant WER reductions with low resource usage.
Recent innovations include contextual extensions, parallel GPU rescoring, and discriminative training techniques that optimize performance while reducing computational overhead.

Lattice rescoring protocols constitute a critical component of modern automatic speech recognition (ASR) pipelines, enabling improved hypothesis selection by incorporating more flexible or powerful LMs in a second pass over decoding lattices. These protocols allow integration of advanced neural or context-enhanced LMs, minimize computational and memory costs during first-pass decoding, and facilitate rich USE of supplementary data including large-scale text corpora or external context. The following article surveys and synthesizes major algorithmic frameworks, mathematical formulations, and empirical trade-offs underpinning lattice rescoring, with particular emphasis on recent innovations in low-resource ASR, neural network integration, contextualization, and computational optimization.

1. Foundations of Lattice Rescoring

Lattice rescoring is a two-pass inference strategy historically motivated by the need to balance efficiency and expressivity in ASR. In the first pass, a fast decoder (often an n-gram LM composed with a weighted finite-state transducer, WFST) produces a word lattice, a directed acyclic graph (DAG) where paths correspond to alternative hypotheses. Each arc is typically annotated with an acoustic likelihood and an LM (usually n-gram) score. In the second pass, the lattice arcs are re-evaluated using a more expressive LM, such as an LSTM or Transformer, and a new best path is selected. The general objective is

$\hat W = \arg\max_W P(O | W) P(W)$

where $O$ is the observed acoustic sequence and $P(W)$ is updated in the second pass via rescoring. The push-forward algorithm and its variants are canonical mechanisms for integrating neural LMs into lattice rescoring, maintaining per-node histories for efficient dynamic programming over the lattice structure (Kumar et al., 2017).

2. Minimally Augmented Lattice Rescoring in Low-Resource Settings

Significant challenges arise in low-resource ASR, where the baseline LM, typically trained on scarce transcripts, cannot generate lattices inclusive of plausible out-of-vocabulary (OOV) words. "Initial Decoding with Minimally Augmented LLM for Improved Lattice Rescoring in Low Resource ASR" introduces a two-pass protocol that efficiently injects missing unigram probabilities for OOV terms derived from a large external text corpus, such as Wikipedia (Murthy et al., 16 Mar 2024). The workflow is as follows:

OOT-word selection:
- Let $V_{\mathrm{base}}$ be the baseline LM vocabulary and $V_{\mathrm{wiki}}$ the external text vocabulary. The out-of-train set $OOT = V_{\mathrm{wiki}} \setminus V_{\mathrm{base}}$ is determined.
- Unigram counts $c_{\mathrm{wiki}}(w)$ are extracted for $w \in OOT$ .
OWALM construction:
- Interpolate the baseline $n$ -gram LM and the OOT unigram LM:
$P_{\mathrm{OWALM}}(w) = \lambda\, P_{\mathrm{base}}(w) + (1-\lambda)\, \frac{c_{\mathrm{wiki}}(w)}{\sum_{u \in OOT} c_{\mathrm{wiki}}(u)}$

with $\lambda$ tuned on held-out data.
Initial decoding:
- ASR decoding proceeds using $G_{\mathrm{OWALM}}$ , yielding richer lattices containing more alternate hypotheses, including previously OOV words.
Rescoring:
- The lattices are rescored using a full n-gram LM trained on all available data (baseline + Wikipedia).
- A linear score interpolation is applied:
$\mathrm{score}_{\mathrm{rescore}}(h) = \log P_{\mathrm{acoustic}}(h) + \alpha \log P_{\mathrm{full}}(h) + (1-\alpha) \log P_{\mathrm{OWALM}}(h)$

Typically, $\alpha=1$ is used.

This protocol achieves relative WER reductions of 21.8% (Telugu) and 41.8% (Kannada), nearly matching single-pass decoding with a full-augmented LM, but at 1/8 of the memory cost (Murthy et al., 16 Mar 2024).

3. Advanced Neural Lattice Rescoring Methods

The application of neural LMs—LSTM, Transformer, or hybrid architectures—within lattice rescoring protocols leverages their superior modeling of longer-range and contextual dependencies. A canonical framework (Kumar et al., 2017, Ogawa et al., 2023) comprises:

Push-forward algorithm:

Maintains $k$ best hypotheses at each node, updating LSTM/Transformer states along lattice arcs and pruning by cost. For LSTMs, each arc $a$ is rescored as $-\log P_{\mathrm{LSTM}}(w_a \mid \mathrm{history})$ .

State-pooling and arc-beam:

Pool predecessor states (weighted by, e.g., max-prob or posterior) or, for each arc, select the best predecessor; both trade off between accuracy and computation.

An iterative ensemble protocol (Ogawa et al., 2023) applies multiple NLMs (e.g., forward/backward LSTM and Transformer) sequentially by convexly interpolating their scores at each pass:

$L_{\mathrm{final}}(a) = \sum_{i=0}^I \gamma_i\, \log p^i(w_t | \cdot)$

where $\gamma_i=1/(I+1)$ , ensuring equal weighting across $I$ NLMs plus the base model. Context carry-over (using previous lattice's final hidden state) further enhances performance in continuous speech domains.

Empirical findings demonstrate up to 24.4% WER reduction over 1-best baselines using an 8-model NLM ensemble with context in long-form ASR (Ogawa et al., 2023). Self-normalization and product quantization enable efficiency and compression for large neural LMs (Kumar et al., 2017).

4. Contextual, Semantic, and Metadata-Driven Extensions

Modern protocols increasingly extend lattice rescoring with contextual adaptation. Proposed directions include:

Context-aware RNNLM rescoring:

Lattices from adjacent utterances in conversational speech are concatenated via tagged arcs (e.g., sentence boundary, speaker ID, intent), enabling RNNLMs to model long-range dependencies. Selective connection based on tf-idf similarity ensures only topically relevant contexts propagate, reducing character error rates by up to 13.1% (Wei et al., 2020).

Attention and pointer-based integration of external metadata:

Lattice rescoring models may attend to or copy from video metadata, such as titles or descriptions, with an attention or hybrid pointer network LM (Liu et al., 2020). Such models interpolate standard generation and pointer distributions for each lattice arc, yielding optimal performance when the external context is rich in rare or informative tokens.

Semantic and Transformer-based rescoring:

Transformer LMs trained with cross-entropy are used to rescore compacted or n-best paths, combining acoustic, n-gram, and semantic model log-probabilities with tunable weights (Sudarshan et al., 2023).

5. Computational and Memory-Efficiency Innovations

Lattice rescoring poses non-trivial computational burdens, particularly with large neural LMs or in resource-constrained environments. Several strategies have been presented:

Single-shot non-autoregressive lattice rescoring:

The LT-LM approach processes the entire lattice in parallel using a non-autoregressive Transformer encoder, outputting scores for all arcs simultaneously. Artificial training lattices are synthesized from large-scale text to augment data scarcity. LT-LM yields 300–600× speedup over streaming autoregressive rescoring, with only ~0.4% absolute WER gap on LibriSpeech (Mitrofanov et al., 2021).

Parallel path-based GPU rescoring:

Path-cover strategies convert expanded lattices into a minimal set of full paths, batch-score these using neural LMs (e.g., in PyTorch), and integrate results back into the lattice, achieving significant runtime reductions (Li et al., 2021).

Memory-efficient lattice graph construction:

In low-resource protocols (Murthy et al., 16 Mar 2024), minimally augmented LMs inject only missing unigrams, reducing HCLG memory usage by 8× while preserving almost the entire WER gain of a full augmented LM.

Efficient state-merging and context-carrying for LLM rescoring:

Lattice arcs are merged by context, converting tree outputs into compact DAGs, and 1-best context is prepended across segments for LLMs. This protocol enables practical LLM rescoring in long-form and code-switched tasks (Chen et al., 2023).

6. Discriminative and Hybrid Training Objectives

Integrating discriminative losses, as in MWER or MWED, into lattice rescoring allows models to optimize directly for sequence-level error metrics. RescoreBERT (Xu et al., 2022) demonstrates the following:

Fine-tune a BERT model via MWER/MWED loss over n-best hypotheses, optionally fusing an MLM-based distillation pretraining target.
Compose final score per hypothesis as $s_i = s^a_i + \beta s^l_i$ .
Achieve 6.6%/3.4% relative WER gains on LibriSpeech test sets compared to non-discriminative BERT, at competitive inference latencies.

This approach demonstrates the complementarity of deep bidirectional LMs and discriminative objectives for second-pass rescoring.

7. Rescoring Beyond ASR: Applications in Lattice Codes

Although the term "lattice rescoring" most commonly refers to ASR, analogous protocols arise in finite-dimensional lattice codes for communications (Xue et al., 8 Jan 2025). Here, "rescoring" is implemented by multi-trial decoding using embedded error-detection (CRC) and per-trial optimization of decoding coefficients. Retry decoding achieves up to 1.31 dB coding gain for two-user compute-forward relays at $10^{-5}$ error rate, all at minimal extra decoding cost.