Streaming Bilingual End-to-End ASR model using Attention over Multiple Softmax (2401.11645v1)
Abstract: Even with several advancements in multilingual modeling, it is challenging to recognize multiple languages using a single neural model, without knowing the input language and most multilingual models assume the availability of the input language. In this work, we propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages and also support switching between the languages, without any language input from the user. The proposed model has shared encoder and prediction networks, with language-specific joint networks that are combined via a self-attention mechanism. As the language-specific posteriors are combined, it produces a single posterior probability over all the output symbols, enabling a single beam search decoding and also allowing dynamic switching between the languages. The proposed approach outperforms the conventional bilingual baseline with 13.3%, 8.23% and 1.3% word error rate relative reduction on Hindi, English and code-mixed test sets, respectively.
- “Multiple softmax architecture for streaming multilingual end-to-end asr systems,” in Proc. Interspeech, 2021, pp. 1767–1771.
- “An investigation of deep neural networks for multilingual speech recognition training and adaptation,” in Proc. of INTERSPEECH, 2017, number CONF.
- “Large-scale multilingual speech recognition with a streaming end-to-end model,” 2019.
- “Multilingual MLP features for low-resource LVCSR systems,” in Proceedings of ICASSP, 2012, pp. 4269–4272.
- “Unsupervised cross-lingual knowledge transfer in dnn-based lvcsr,” in IEEE Spoken Language Technology Workshop (SLT). IEEE, 2012, pp. 246–251.
- “Multilingual acoustic models using distributed deep neural networks,” in Proceedings of ICASSP, 2013, pp. 8619–8623.
- “Transfer learning for speech recognition on a budget,” in Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, Canada, Aug. 2017, pp. 168–177, ACL.
- “Multi-accent deep neural network acoustic model with accent-specific top layer using the KLD-regularized model adaptation,” in Proceedings of Interspeech, 2014.
- “On the Use of a Multilingual Neural Network Front-End,” in Proceedings of Interspeech, 2008, pp. 2711–2714.
- Dong Yu and Li Deng, “Efficient and effective algorithms for training single-hidden-layer neural networks,” Pattern Recognition Letters, January 2012.
- “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” in Proceedings of ICASSP, 2013, pp. 7304–7308.
- M. L. Seltzer and J. Droppo, “Multi-task learning in deep neural networks for improved phoneme recognition,” in Proceedings of ICASSP, 2013, pp. 6965–6969.
- Li Deng and John C. Platt, “Ensemble deep learning for speech recognition,” in Proceedings of Interspeech, 2014.
- “Towards acoustic model unification across dialects,” in 2016 IEEE Workshop on Spoken Language Technology, 2016.
- “Multi-Dialect Speech Recognition in English Using Attention on Ensemble,” in Proceedings of ICASSP, 2021.
- “Transfer Learning Approaches for Streaming End-to-End Speech Recognition System,” in Proceedings of Interspeech 2020, 2020, pp. 2152–2156.
- “Bootstrap an end-to-end asr system by multilingual training, transfer learning, text-to-text mapping and synthetic audio,” 2020.
- “Improving the performance of transformer based low resource speech recognition for Indian languages,” in Proceedings of ICASSP, 2020, pp. 8279–8283.
- “Streaming end-to-end bilingual asr systems with joint language identification,” 2020.
- “Massively multilingual asr: 50 languages, 1 model, 1 billion parameters,” 2020.
- “Multilingual speech recognition with self-attention structured parameterization,” in Proceedings of INTERSPEECH, 2020.
- “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in Proceedings of ICASSP, 2019, pp. 5621–5625.
- Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv:1211.3711, 2012.
- “Highway long short-term memory RNNS for distant speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 5755–5759.
- “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, pp. 8024–8035. 2019.
- “Developing RNN-T models surpassing high-performance hybrid models with customization capability,” in Proceedings of Interspeech, 2020.
- “On the comparison of popular end-to-end models for large scale speech recognition,” in Proceedings of Interspeech, 2020.
- “Improving RNN transducer modeling for end-to-end speech recognition,” in Proceedings of ASRU, 2019.