Convoifilter: A case study of doing cocktail party speech recognition
Abstract: This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise (ConVoiFilter) and an ASR module. The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
- “Super-human performance in online low-latency recognition of conversational speech,” in Interspeech 2021. 2021, ISCA.
- “Multimodal people id for a multimedia meeting browser,” in Proceedings of the seventh ACM international conference on Multimedia (Part 1), 1999, pp. 159–168.
- “Multimodal meeting tracker.,” in RIAO. Citeseer, 2000, pp. 32–45.
- “An investigation into the effectiveness of enhancement in asr training and test for chime-5 dinner party transcription,” in 2019 IEEE (ASRU). IEEE, 2019.
- Computers in the Human Interaction Loop, pp. 3–6, Springer London, London, 2009.
- Alex Waibel, Beyond CHIL, pp. 367–371, Springer London, London, 2009.
- “Far-field automatic speech recognition,” Proceedings of the IEEE, vol. 109, no. 2, pp. 124–148, 2021.
- “Deep extractor network for target speaker recovery from single channel speech mixtures,” in INTERSPEECH, 2018.
- Marc Delcroix and et al., “Single channel target speaker extraction and recognition with speaker beam,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5554–5558.
- Quan Wang and et al., “VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking,” in Proc. Interspeech 2019, 2019, pp. 2728–2732.
- Quan Wang and et al., “VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 2677–2681.
- Juan M. Coria and et al., “A comparison of metric learning loss functions for end-to-end speaker verification,” in Statistical Language and Speech Processing, Luis Espinosa-Anke, Carlos Martín-Vide, and Irena Spasić, Eds., Cham, 2020, pp. 137–148, Springer International Publishing.
- Yi Luo and Nima Mesgarani, “Tasnet: Time-domain audio separation network for real-time, single-channel speech separation,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700, 2018.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS 2020, 2020.
- Anmol Gulati and et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
- “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
- Abdelrahman Mohamed and et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, pp. 1–34, 2022.
- Alex Graves, “Sequence transduction with recurrent neural networks,” 2012.
- “The fifth ’chime’ speech separation and recognition challenge: Dataset, task and baselines,” CoRR, vol. abs/1803.10609, 2018.
- “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
- Gordon Wichern and et al., “Wham!: Extending speech separation to noisy environments,” CoRR, vol. abs/1907.01160, 2019.
- Tom Ko and et al., “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
- “Building and evaluation of a real room impulse response dataset,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 863–876, 2019.
- Zhuo Chen and et al., “Continuous speech separation: Dataset and analysis,” in ICASSP 2020, 2020, pp. 7284–7288.
- “Robust speech recognition via large-scale weak supervision,” 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.