Effective internal language model training and fusion for factorized transducer model (2404.01716v1)
Abstract: The internal LLM (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external LLMs. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal LLM for non-blank token prediction. However, even with the adoption of factorized transducer models, limited improvement has been observed compared to shallow fusion. In this paper, we propose a novel ILM training and decoding strategy for factorized transducer models, which effectively combines the blank, acoustic and ILM scores. Our experiments show a 17% relative improvement over the standard decoding method when utilizing a well-trained ILM and the proposed decoding strategy on LibriSpeech datasets. Furthermore, when compared to a strong RNN-T baseline enhanced with external LM fusion, the proposed model yields a 5.5% relative improvement on general-sets and an 8.9% WER reduction for rare words. The proposed model can achieve superior performance without relying on external LLMs, rendering it highly efficient for production use-cases. To further improve the performance, we propose a novel and memory-efficient ILM-fusion-aware minimum word error rate (MWER) training method which improves ILM integration significantly.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006, pp. 369–376.
- Alex Graves, “Sequence transduction with recurrent neural networks,” CoRR, vol. abs/1211.3711, 2012.
- “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964.
- “Streaming end-to-end speech recognition for mobile devices,” in Proc. ICASSP, 2019, pp. 6381–6385.
- “Deep speech: Scaling up end-to-end speech recognition,” 2014, vol. abs/1412.5567.
- “A density ratio approach to language model fusion in end-to-end automatic speech recognition,” in Proc. ASRU, 2019, pp. 434–441.
- “Hybrid autoregressive transducer (HAT),” in Proc. ICASSP, 2020, pp. 6139–6143.
- “Internal language model estimation for domain-adaptive end-to-end speech recognition,” in SLT, 2021.
- “Librispeech transducer model with internal language model prior correction,” in Interspeech, 2021.
- “Factorized neural transducer for efficient language model adaptation,” in ICASSP, 2022.
- “Modular hybrid autoregressive transducer,” in Proc. SLT, 2023, pp. 197–204.
- Z. Meng and et al., “Jeit: Joint end-to-end model and internal language model training for speech recognition,” in ICASSP, 2023.
- “Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models,” in ICASSP, 2023.
- “Sequence-discriminative training of deep neural networks.,” in Proc. Interspeech, 2013, pp. 2345–2349.
- “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. ICLR, 2015, pp. 1764–1772.
- M. Shannon, “Optimizing expected word error rate via sampling for speech recognition,” in Proc. Interspeech, 2017.
- “Minimum word error rate training for attention-based sequence-to-sequence models,” in Proc. ICASSP, 2018.
- “Efficient minimum word error rate training of rnn-transducer for end-to-end speech recognition,” in Proc. Interspeech, 2020.
- “Minimum word error rate training with language model fusion for end-to-end speech recognition,” in Proc. Interspeech, 2021.
- “Improving rare word recognition with lm-aware mwer training,” in Proc. Interspeech, 2022.
- “Alignment restricted streaming recurrent neural network transducer,” in SLT, 2021.
- “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
- “Specaugment: A simple data augmentation method for automatic speech recognition,” CoRR, vol. abs/1904.08779, 2019.
- “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in Proc. ICASSP, 2021, pp. 6783–6787.
- Jinxi Guo (15 papers)
- Niko Moritz (23 papers)
- Yingyi Ma (9 papers)
- Frank Seide (16 papers)
- Chunyang Wu (24 papers)
- Jay Mahadeokar (36 papers)
- Ozlem Kalinli (49 papers)
- Christian Fuegen (36 papers)
- Mike Seltzer (12 papers)