Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An analysis of incorporating an external language model into a sequence-to-sequence model (1712.01996v1)

Published 6 Dec 2017 in eess.AS, cs.AI, cs.CL, and cs.SD

Abstract: Attention-based sequence-to-sequence models for automatic speech recognition jointly train an acoustic model, LLM, and alignment mechanism. Thus, the LLM component is only trained on transcribed audio-text pairs. This leads to the use of shallow fusion with an external LLM at inference time. Shallow fusion refers to log-linear interpolation with a separately trained LLM at each step of the beam search. In this work, we investigate the behavior of shallow fusion across a range of conditions: different types of LLMs, different decoding units, and different tasks. On Google Voice Search, we demonstrate that the use of shallow fusion with a neural LM with wordpieces yields a 9.1% relative word error rate reduction (WERR) over our competitive attention-based sequence-to-sequence model, obviating the need for second-pass rescoring.

Integration of External LLMs into Sequence-to-Sequence Models for ASR

This paper addresses the integration of an external LLM (LM) into attention-based sequence-to-sequence models for automatic speech recognition (ASR), specifically focusing on the Listen, Attend, and Spell (LAS) framework. LAS models inherently couple acoustic modeling, LLMing, and alignment within a single neural network. However, during training, the LLM is limited to the transcribed audio-text pairs, constraining its ability to generalize, especially for rare words or phrases. To mitigate this limitation, the paper investigates the use of shallow fusion during the inference phase, which involves the log-linear interpolation of a separately trained LLM at each step of the beam search.

The research explores the influence of different LLM types, decoding units, and tasks on the effectiveness of shallow fusion. It compares various LLM architectures, including RNN LMs and traditional nn-gram LMs, in conjunction with grapheme-based and wordpiece-based decoding units. The paper conducts experiments on both the well-known Wall Street Journal (WSJ) corpus and Google's large-scale Voice Search task to understand the scalability and generalizability of the approach.

Key findings of the research include:

  1. Performance on WSJ Corpus: The experiments indicate that shallow fusion with RNN LMs consistently outperformed nn-gram models in reducing the word error rate (WER). In the context of WSJ, shallow fusion with an RNN model provided significant improvement over the baseline, demonstrating the RNN's superior handling of context compared to nn-gram LMs.
  2. Wordpieces vs. Graphemes: The research extends shallow fusion to wordpiece-based decoding units, showing that while wordpiece-based models have a disadvantage on smaller datasets due to their complexity, they do benefit from fewer dependencies to model, as demonstrated by good performance on WSJ when coupled with strong external LMs.
  3. Scalability to Large-Scale Tasks: When applied to the Google Voice Search task, a significantly larger corpus than WSJ, both grapheme and wordpiece models showed competitive performance without external support. However, RNN LMs showed their strength by providing a 9.1% relative word error rate reduction over the baseline, eliminating the need for secondary rescoring passes and handling the extensive vocabulary breadth required by the task.
  4. Efficiency Considerations: The paper highlights the compactness and efficiency of RNN LMs over traditional nn-gram models, particularly in the context of memory and computational resources. The RNN LMs offer a substantial reduction in WER while maintaining a manageable model size, making them suitable for first-pass decoding strategies even in infrastructure-constrained scenarios.

These insights not only reveal the practical efficacy of integrating external LMs into sequence-to-sequence models but also shed light on future directions for enhancing neural ASR systems. The implication is significant for real-time applications where latency, memory, and accuracy are critical factors. Future directions could explore novel methods for more seamless integration of LLMs, leveraging advanced neural networks and exploring further the domain-specific adaptation to extend the flexibility and applicability of ASR systems across more diverse tasks and languages.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Anjuli Kannan (19 papers)
  2. Yonghui Wu (115 papers)
  3. Patrick Nguyen (15 papers)
  4. Tara N. Sainath (79 papers)
  5. Zhifeng Chen (65 papers)
  6. Rohit Prabhavalkar (59 papers)
Citations (243)