Cascaded encoders for fine-tuning ASR models on overlapped speech (2306.16398v1)
Abstract: Multi-talker speech recognition (MT-ASR) has been shown to improve ASR performance on speech containing overlapping utterances from more than one speaker. Multi-talker models have typically been trained from scratch using simulated or actual overlapping speech datasets. On the other hand, the trend in ASR has been to train foundation models using massive datasets collected from a wide variety of task domains. Given the scale of these models and their ability to generalize well across a variety of domains, it makes sense to consider scenarios where a foundation model is augmented with multi-talker capability. This paper presents an MT-ASR model formed by combining a well-trained foundation model with a multi-talker mask model in a cascaded RNN-T encoder configuration. Experimental results show that the cascade configuration provides improved WER on overlapping speech utterances with respect to a baseline multi-talker model without sacrificing performance achievable by the foundation model on non-overlapping utterances.
- Ö. Çetin and E. Shriberg, ``Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition,'' in InterSpeech, 2006.
- ``End-to-end multi-talker overlapping speech recognition,'' in ICASSP 2020, pp. 6129–6133.
- ``Permutation invariant training of deep models for speaker-independent multi-talker speech separation,'' in ICASSP 2017, pp. 241–245.
- ``Investigation of end-to-end speaker-attributed ASR for continuous multi-talker recordings,'' in IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 809–816.
- ``End-to-end multi-speaker speech recognition with transformer,'' in ICASSP 2020.
- ``Streaming end-to-end multi-talker speech recognition,'' IEEE Signal Processing Letters, April 2021.
- ``End-to-end audio-visual speech recognition for overlapping speech,'' in InterSpeech, 2021.
- ``End-to-end multi-talker audio-visual ASR using an active speaker attention module,'' in InterSpeech, 2022.
- ``Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech,'' in Interspeech 2022, pp. 4451–4455.
- ``Continuous streaming multi-talker ASR with dual-path transducers,'' ArXiv, vol. abs/2109.08555, 2021.
- ``Multi-turn RNN-T for streaming recognition of multi-party speech,'' ArXiv, vol. abs/2112.10200, 2021.
- ``Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers,'' arXiv preprint arXiv:2107.14446, 2021.
- Yu Zhang et al, ``Google USM: Scaling automatic speech recognition beyond 100 languages,'' arXiv preprint arXiv:2303.01037, 2023.
- ``Robust speech recognition via large-scale weak supervision,'' arXiv preprint arXiv:2303.01037, 2023.
- ``wav2vec 2.0: A framework for self-supervised learning of speech representations,'' in NeurIPS 2020, December, 2020.
- ``Scaling end-to-end models for large-scale multilingual ASR,'' in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021.
- ``Speechstew: Simply mix all available speech recognition data to train one large neural network,'' 2021.
- ``Large-scale ASR domain adaptation using self- and semi-supervised learning,'' in ICASSP 2022.
- ``Cascaded encoders for unifying streaming and non-streaming ASR,'' in ICASSP 2021, pp. 5629–5633.
- ``On robustness to missing video for audiovisual speech recognition,'' Transactions on Machine Learning Research, 2022.
- ``Understanding intermediate layers using linear classifier probes,'' in ICLR 2017.
- ``Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription,'' in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2013, pp. 368–373.
- ``Large-scale visual speech recognition,'' in InterSpeech, 2018.
- ``Recurrent neural network transducer for audio-visual speech recognition,'' in 2019 IEEE ASRU Workshop, pp. 905–912.
- ``Continuous speech separation: dataset and analysis,'' in ICASSP 2020.
- Google, ``Artificial intelligence at Google: Our principles,'' https://ai.google/principles/.
- European Union Law, ``Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/EC (General Data Protection Regulation),'' https://eurlex.europa.eu/legal-content/EN/TXT/?uri=CELEX.
- ``Conformer: Convolution-augmented transformer for speech recognition,'' in InterSpeech 2020.
- ``Fearless Steps challenge (fs-2): Supervised learning with massive naturalistic apollo data,'' in InterSpeech 2020.
- Richard Rose (2 papers)
- Oscar Chang (20 papers)
- Olivier Siohan (13 papers)