Integration of External LLMs into Sequence-to-Sequence Models for ASR
This paper addresses the integration of an external LLM (LM) into attention-based sequence-to-sequence models for automatic speech recognition (ASR), specifically focusing on the Listen, Attend, and Spell (LAS) framework. LAS models inherently couple acoustic modeling, LLMing, and alignment within a single neural network. However, during training, the LLM is limited to the transcribed audio-text pairs, constraining its ability to generalize, especially for rare words or phrases. To mitigate this limitation, the paper investigates the use of shallow fusion during the inference phase, which involves the log-linear interpolation of a separately trained LLM at each step of the beam search.
The research explores the influence of different LLM types, decoding units, and tasks on the effectiveness of shallow fusion. It compares various LLM architectures, including RNN LMs and traditional -gram LMs, in conjunction with grapheme-based and wordpiece-based decoding units. The paper conducts experiments on both the well-known Wall Street Journal (WSJ) corpus and Google's large-scale Voice Search task to understand the scalability and generalizability of the approach.
Key findings of the research include:
- Performance on WSJ Corpus: The experiments indicate that shallow fusion with RNN LMs consistently outperformed -gram models in reducing the word error rate (WER). In the context of WSJ, shallow fusion with an RNN model provided significant improvement over the baseline, demonstrating the RNN's superior handling of context compared to -gram LMs.
- Wordpieces vs. Graphemes: The research extends shallow fusion to wordpiece-based decoding units, showing that while wordpiece-based models have a disadvantage on smaller datasets due to their complexity, they do benefit from fewer dependencies to model, as demonstrated by good performance on WSJ when coupled with strong external LMs.
- Scalability to Large-Scale Tasks: When applied to the Google Voice Search task, a significantly larger corpus than WSJ, both grapheme and wordpiece models showed competitive performance without external support. However, RNN LMs showed their strength by providing a 9.1% relative word error rate reduction over the baseline, eliminating the need for secondary rescoring passes and handling the extensive vocabulary breadth required by the task.
- Efficiency Considerations: The paper highlights the compactness and efficiency of RNN LMs over traditional -gram models, particularly in the context of memory and computational resources. The RNN LMs offer a substantial reduction in WER while maintaining a manageable model size, making them suitable for first-pass decoding strategies even in infrastructure-constrained scenarios.
These insights not only reveal the practical efficacy of integrating external LMs into sequence-to-sequence models but also shed light on future directions for enhancing neural ASR systems. The implication is significant for real-time applications where latency, memory, and accuracy are critical factors. Future directions could explore novel methods for more seamless integration of LLMs, leveraging advanced neural networks and exploring further the domain-specific adaptation to extend the flexibility and applicability of ASR systems across more diverse tasks and languages.