Minimally Augmented Lattice Rescoring

Updated 15 January 2026

The paper introduces a novel ASR approach that minimally augments decoding lattices to ensure all out-of-train unigrams are represented, boosting OOV recall.
It employs a count-merging scheme combining baseline and unigram language models with strict minimality to balance enhanced hypothesis inclusion and resource efficiency.
Empirical results show significant WER improvements and OOV-recovery increases from 6–7% to 35–56% alongside up to 300× speedup in neural rescoring.

Minimally augmented lattice rescoring denotes a class of methods in automatic speech recognition (ASR) and related sequence prediction tasks that introduce the smallest possible modifications to a decoding lattice or the associated LLM (LM), with the explicit goal of maximizing OOV (out-of-vocabulary) or rare-event recall in the final hypotheses while rigorously controlling computational and memory requirements. These techniques address the longstanding challenge wherein the initial decoding with a constrained or under-resourced LM eliminates or fails to score many correct hypotheses, rendering later application of more powerful LMs ineffective. Minimally augmented rescoring frameworks prioritize lattice inclusivity and neural LM efficiency through judicious alteration of baseline LMs, topology expansion under strict posterior or unigram-driven control, and parallel or single-shot neural rescoring architectures.

1. Baseline Lattice Rescoring and Its Limitations

In a hybrid WFST-based ASR pipeline, the decoding graph $HCLG = \min\,\det(H \circ C \circ L \circ G)$ combines HMM topology ( $H$ ), context expansion ( $C$ ), lexicon ( $L$ ), and grammar/Langauge Model FST ( $G$ ). The baseline LM is often a small $n$ -gram, built from transcript-only corpora under Witten–Bell or similar smoothing, resulting in severe vocabulary undercoverage for low-resource languages or domains with large, available text corpora. Standard n-best or lattice rescoring using a larger LM ( $G'$ ) only revisits hypotheses present in the original lattice; words not appearing as arcs are irretrievable, so true OOVs in the reference are unrecoverable in rescoring. The absence of sufficient arcs representing OOT (out-of-train) words is thus the principal bottleneck for OOV recall in these systems (Murthy et al., 2024).

2. Minimally Augmented LLMs for Inclusive Lattices

To overcome the vocabulary bottleneck without incurring the memory explosion of large-vocabulary LM decoding, Murthy & Sitaram propose augmenting the baseline $n$ -gram LM with unigram counts drawn from a large text corpus (e.g., Wikipedia), specifically for those OOT words missing from the transcript LM. The union $V_b \cup V_{\text{oot}}$ forms the vocabulary of the resulting augmented LM. No thresholding or ranking is applied; the augmentation is strictly the set-difference of unigrams ( $W_{\text{wiki}} \setminus W_{\text{train}}$ ), preserving strict minimality.

Probability merging employs a count-merging scheme:

$p_{\mathrm{CM}}(w|h) = \sum_i \lambda_i(h) p_i(w|h), \quad \lambda_i(h) = \frac{\beta_i c_i(h)}{\sum_j \beta_j c_j(h)}$

where $i\in\{b, u\}$ indexes the baseline and unigram LMs, $c_i(h)$ is the history count, with $p_u(w|h) = c_u(w) / \sum_x c_u(x)$ as the unigram probability, and $\beta_i$ are hand-tuned (commonly $\beta_b=\beta_u=1$ ).

This approach minimally increases the lattice vocabulary such that, after decoding, lattices are guaranteed to contain at least one arc for every OOT unigram, thereby enabling downstream LM rescoring methods to reach those words (Murthy et al., 2024).

3. Two-Pass Lattice Decoding and Rescoring Workflow

The full workflow proceeds as:

Initial Decode: Construct $HCLG$ using the minimally augmented LM, ensuring all OOT unigrams have corresponding arcs in the lattice.
Rescoring Pass: Apply lattice-lm-rescore (e.g., Kaldi implementation) using a full, large-vocabulary LM (e.g., transcript + Wikipedia) constructed through count-merging or other interpolation.

Each arc $e$ in the lattice is rescored as:

$\text{newScore}(e) = a(e) + \alpha \log p_{\text{large}}(w|h) - \mathrm{IP}$

where $a(e)$ is the acoustic weight, $\alpha$ the LM scale, and IP a word-insertion penalty.

Empirically, this two-pass strategy nearly matches the WER improvements of full large-LM first-pass decoding, while reducing the decoding graph's memory footprint by approximately one-eighth for languages such as Telugu and Kannada. For instance, Telugu: $4$ GB ( $+$ OOT) vs $32$ GB (full Wiki); Kannada: $<2$ GB ( $+$ OOT) vs $18$ GB (Murthy et al., 2024).

Method	Telugu WER (%)	Kannada WER (%)
Baseline decode (no rescoring)	25.51	51.87
Decode+Rescore (baseline → full Wiki)	25.51 → 25.14	51.87 → 50.08
Full-Wiki decode	21.01	28.12
OWALM-decode + full-Wiki rescore	26.76 → 20.92	52.01 → 30.27

The OOV-recovery rate rises from 6–7% with naïve rescoring to 35–56% with minimal augmentation.

4. Minimal Lattice Expansion in Neural Rescoring

Beyond $n$ -gram augmentation, minimal expansion techniques can be generalized to neural LM rescoring. Posterior-based lattice expansion extends the initial lattice only on high-posterior (likely) arcs by threshold $\varepsilon$ , duplicating states as necessary to maintain unambiguous LM contexts and avoid exponential blowup.

After expansion, a minimal best-path cover is extracted: a smallest set of complete word hypotheses such that each lattice arc lies on at least one path, and each path is optimal for some arc. These hypotheses are then scored in parallel by a neural LM (e.g., LSTM or Transformer). The final LM cost per arc is interpolated as $c'_{\text{LM}}(e) = (1-\lambda)c_{\text{Ngram}}(e) + \lambda c_{\text{NN}}(e)$ . This approach yields lattices with sharply reduced size (e.g., 6.4 vs 31.5 arcs/frame) and offers fast parallel computation on GPU/TPU with negligible loss in recognition accuracy (Li et al., 2021).

5. Non-Autoregressive, Single-Shot Lattice Rescoring

The LT-LM framework represents a distinct, non-autoregressive approach to minimally augmented lattice rescoring. Here, the entire lattice is encoded as a bag of arcs, each represented by the sum of word and positional embeddings. Multiple Transformer encoder layers perform self-attention over the arc set, and a scoring head predicts, for each arc, its probability of lying on the oracle (lowest WER) path. The system's training combines real and artificially generated lattices with a binary cross-entropy oracle-path loss.

At inference, a single forward pass produces new arc scores $s_i'$ , yielding final arc weights through a geometric or log-linear combination with acoustic scores and original LM (hyperparameters $a$ , $\ell_1$ , $\ell_2$ per (Mitrofanov et al., 2021)). The method’s computational complexity is $O(L d N^2)$ per lattice (with $N$ arcs, $L$ layers, $d$ hidden size), and demonstrates $\gtrsim 300\times$ speedup compared to pruned-RNNLM lattice rescoring, approaching the WER of much heavier Transformer-LM n-best methodologies. For example, on LibriSpeech dev_clean:

LT-LM single-shot WER: 2.51%
Transformer 500-best: 2.50%
Decoding speed: 3.8 seconds (LT-LM) vs 15m41s (Transformer 500-best) (Mitrofanov et al., 2021).

6. Memory, Computational Efficiency, and Tradeoffs

Minimally augmented techniques maintain tight control over memory allocation and runtime by avoiding unnecessary vocabulary expansion, limiting lattice structure growth to what is empirically and probabilistically justified, and leveraging batch neural LM efficiencies. For instance, the OWALM (Out-Of-Train Words Augmented LLM) method achieves nearly all of the WER gain of full large-LM decoding but requires only 4 GB RAM for graph construction (vs. 32 GB for baseline+Wiki on Telugu).

The primary tradeoffs involve inclusivity versus lattice or model size and WER. For posterior-based expansion, lower $\varepsilon$ yields more accurate but larger lattices. Non-autoregressive approaches trade a very small loss in WER for orders-of-magnitude speedup.

7. Limitations and Future Directions

Minimally augmented lattice rescoring does not directly address named-entity variants absent from external unigrams. In agglutinative languages, morpheme-based or subword LM augmentation (e.g., via Morfessor or BPE tokenization) remains necessary for full surface-form recall. The applicability to non-WFST (end-to-end) ASR architectures is an open research direction.

Prospective extensions include dynamic OOV word registration at runtime, integration with TTS-synthesized audio to augment rare-word representation, and continual learning of minimal augmentations. More generally, the principles of minimizing lattice augmentation while maximizing recall and LM scoring efficiency provide a blueprint for robust rescoring across diverse, resource-constrained ASR and sequence modeling tasks (Murthy et al., 2024, Li et al., 2021, Mitrofanov et al., 2021).