Acoustic Model Fusion for End-to-end Speech Recognition (2310.07062v1)
Abstract: Recent advances in deep learning and automatic speech recognition (ASR) have enabled the end-to-end (E2E) ASR system and boosted the accuracy to a new level. The E2E systems implicitly model all conventional ASR components, such as the acoustic model (AM) and the LLM (LM), in a single network trained on audio-text pairs. Despite this simpler system architecture, fusing a separate LM, trained exclusively on text corpora, into the E2E system has proven to be beneficial. However, the application of LM fusion presents certain drawbacks, such as its inability to address the domain mismatch issue inherent to the internal AM. Drawing inspiration from the concept of LM fusion, we propose the integration of an external AM into the E2E system to better address the domain mismatch. By implementing this novel approach, we have achieved a significant reduction in the word error rate, with an impressive drop of up to 14.3% across varied test sets. We also discovered that this AM fusion approach is particularly beneficial in enhancing named entity recognition.
- “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
- “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2011.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
- Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
- “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
- “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
- “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
- “Recent developments on espnet toolkit boosted by conformer,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 5874–5878.
- “An analysis of incorporating an external language model into a sequence-to-sequence model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 1–5828.
- “On using monolingual corpora in neural machine translation,” arXiv preprint arXiv:1503.03535, 2015.
- “Cold fusion: Training seq2seq models together with language models,” arXiv preprint arXiv:1708.06426, 2017.
- “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in 2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 369–375.
- “A density ratio approach to language model fusion in end-to-end automatic speech recognition,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 434–441.
- “Internal language model estimation for domain-adaptive end-to-end speech recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 243–250.
- “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” arXiv preprint arXiv:1808.06226, 2018.
- “Self-normalizing neural networks,” Advances in neural information processing systems, vol. 30, 2017.
- “Sequence-discriminative training of deep neural networks.,” in Interspeech, 2013, vol. 2013, pp. 2345–2349.
- “Wenet 2.0: More productive end-to-end speech recognition toolkit,” in Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, Hanseok Ko and John H. L. Hansen, Eds. 2022, pp. 1661–1665, ISCA.
- “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek, Eds. 2021, pp. 4054–4058, ISCA.
- “Sndcnn: Self-normalizing deep cnns with scaled exponential linear units for speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6854–6858.
- “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- “Personalization of ctc-based end-to-end speech recognition using pronunciation-driven subword tokenization,” in submitted to ICASSP 2024.
- “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- Zhihong Lei (6 papers)
- Mingbin Xu (12 papers)
- Shiyi Han (7 papers)
- Leo Liu (11 papers)
- Zhen Huang (114 papers)
- Tim Ng (9 papers)
- Yuanyuan Zhang (129 papers)
- Ernest Pusateri (10 papers)
- Mirko Hannemann (4 papers)
- Yaqiao Deng (3 papers)
- Man-Hung Siu (5 papers)