A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition (1807.10857v2)

Published 27 Jul 2018 in eess.AS, cs.AI, cs.CL, and cs.SD

Abstract: Attention-based recurrent neural encoder-decoder models present an elegant solution to the automatic speech recognition problem. This approach folds the acoustic model, pronunciation model, and LLM into a single network and requires only a parallel corpus of speech and text for training. However, unlike in conventional approaches that combine separate acoustic and LLMs, it is not clear how to use additional (unpaired) text. While there has been previous work on methods addressing this problem, a thorough comparison among methods is still lacking. In this paper, we compare a suite of past methods and some of our own proposed methods for using unpaired text data to improve encoder-decoder models. For evaluation, we use the medium-sized Switchboard data set and the large-scale Google voice search and dictation data sets. Our results confirm the benefits of using unpaired text across a range of methods and data sets. Surprisingly, for first-pass decoding, the rather simple approach of shallow fusion performs best across data sets. However, for Google data sets we find that cold fusion has a lower oracle error rate and outperforms other approaches after second-pass rescoring on the Google voice search data set.

Citations (153)

View on Semantic Scholar

Summary

The paper demonstrates that shallow fusion achieves significant first-pass decoding improvements over baseline ASR models.
It reveals that deep fusion scales poorly with large datasets while cold fusion excels in reducing word error rates during rescoring.
Novel approaches, such as using a pretrained LM as a lower decoder layer, offer promising and efficient alternatives for ASR enhancement.

An Evaluation of LLM Integration Techniques in Encoder-Decoder ASR Systems

The paper under review provides a comprehensive evaluation of various methods for integrating LLMs (LMs) into attention-based recurrent neural encoder-decoder frameworks, specifically in the context of automatic speech recognition (ASR). The research scrutinizes both existing techniques and novel approaches in leveraging unpaired text data to enhance encoder-decoder models, which are traditionally trained solely on paired speech and text data. Such an endeavor is critical for improving the accuracy and utility of ASR systems, as integrating external LMs has the potential to considerably improve performance by utilizing the vast amounts of available unpaired text data.

Experimental Framework and Methodologies

The paper investigates several prominent methods for LM integration—shallow fusion, deep fusion, and cold fusion—along with two proposed approaches using the medium-sized Switchboard dataset as well as more extensive Google voice search and dictation datasets. In shallow fusion, LM scores are linearly interpolated with ASR model scores during inference, while deep fusion and cold fusion involve tighter integration of ASR and LM models at various stages of training. The paper also introduces new methods, such as incorporating an LM as an additional lower layer in the decoder and using the decoder in a multitask learning framework.

Key Findings

Performance of Fusion Techniques: Shallow fusion consistently demonstrated superior performance across all datasets during first-pass decoding, yielding significant improvements over baseline encoder-decoder models. On the other hand, cold fusion, although slightly less effective during the first pass, excelled when expanded through second-pass rescoring, particularly on Google voice search data, by producing the lowest oracle word error rates.
Scalability of Fusion Methods: The paper indicates that deep fusion does not scale as effectively with larger datasets. It showed negligible gains over the baseline model on extensive datasets like Google’s, though it performed comparably to cold fusion on Switchboard.
Innovative Approaches: Among the novel methods introduced, using a pretrained LM as a lower decoder layer in ASR models exhibited promising results, potentially rivaling the effectiveness of deep and cold fusion methodologies. This suggests that this relatively simple approach may hold potential for further research and development.

Implications and Future Directions

The implications of this research are substantial for both the theoretical understanding and practical deployment of ASR systems. The findings confirm the importance of incorporating unpaired text in enhancing recognition performance and underscore the need for effective LM integration techniques. The superiority of simple yet robust methods like shallow fusion offers insight into efficient model deployment strategies.

Future research could further explore the utility of integrating pretrained LMs as lower decoder layers, alongside scaling experiments to assess robustness across varying dataset sizes. These investigations could provide more granular insights into optimal configurations and training methodologies, thereby advancing the capabilities of end-to-end ASR systems.

PDF Markdown

Related Papers

YouTube

Show All Videos