Papers
Topics
Authors
Recent
Search
2000 character limit reached

Convoifilter: A case study of doing cocktail party speech recognition

Published 22 Aug 2023 in cs.SD, cs.CL, and eess.AS | (2308.11380v3)

Abstract: This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise (ConVoiFilter) and an ASR module. The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. “Super-human performance in online low-latency recognition of conversational speech,” in Interspeech 2021. 2021, ISCA.
  2. “Multimodal people id for a multimedia meeting browser,” in Proceedings of the seventh ACM international conference on Multimedia (Part 1), 1999, pp. 159–168.
  3. “Multimodal meeting tracker.,” in RIAO. Citeseer, 2000, pp. 32–45.
  4. “An investigation into the effectiveness of enhancement in asr training and test for chime-5 dinner party transcription,” in 2019 IEEE (ASRU). IEEE, 2019.
  5. Computers in the Human Interaction Loop, pp. 3–6, Springer London, London, 2009.
  6. Alex Waibel, Beyond CHIL, pp. 367–371, Springer London, London, 2009.
  7. “Far-field automatic speech recognition,” Proceedings of the IEEE, vol. 109, no. 2, pp. 124–148, 2021.
  8. “Deep extractor network for target speaker recovery from single channel speech mixtures,” in INTERSPEECH, 2018.
  9. Marc Delcroix and et al., “Single channel target speaker extraction and recognition with speaker beam,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5554–5558.
  10. Quan Wang and et al., “VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking,” in Proc. Interspeech 2019, 2019, pp. 2728–2732.
  11. Quan Wang and et al., “VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 2677–2681.
  12. Juan M. Coria and et al., “A comparison of metric learning loss functions for end-to-end speaker verification,” in Statistical Language and Speech Processing, Luis Espinosa-Anke, Carlos Martín-Vide, and Irena Spasić, Eds., Cham, 2020, pp. 137–148, Springer International Publishing.
  13. Yi Luo and Nima Mesgarani, “Tasnet: Time-domain audio separation network for real-time, single-channel speech separation,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700, 2018.
  14. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NeurIPS 2020, 2020.
  15. Anmol Gulati and et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
  16. “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
  17. Abdelrahman Mohamed and et al., “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, pp. 1–34, 2022.
  18. Alex Graves, “Sequence transduction with recurrent neural networks,” 2012.
  19. “The fifth ’chime’ speech separation and recognition challenge: Dataset, task and baselines,” CoRR, vol. abs/1803.10609, 2018.
  20. “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  21. “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
  22. Gordon Wichern and et al., “Wham!: Extending speech separation to noisy environments,” CoRR, vol. abs/1907.01160, 2019.
  23. Tom Ko and et al., “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5220–5224.
  24. “Building and evaluation of a real room impulse response dataset,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 863–876, 2019.
  25. Zhuo Chen and et al., “Continuous speech separation: Dataset and analysis,” in ICASSP 2020, 2020, pp. 7284–7288.
  26. “Robust speech recognition via large-scale weak supervision,” 2022.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 9 likes about this paper.