Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition (2310.01688v1)

Published 2 Oct 2023 in eess.AS, cs.CL, and cs.SD

Abstract: This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training andWhisper-style" prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. “Nist rich transcription 2002 evaluation: A preview.,” in LREC, 2002.
  2. “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, pp. 101317, 2022.
  3. “End-to-end speech recognition: A survey,” arXiv preprint arXiv:2303.03329, 2023.
  4. “The fifth CHiME speech separation and recognition challenge: Dataset, task and baselines,” in Proc. InterSpeech, 2018.
  5. “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” in CHiME Workshop, 2020.
  6. J. G. Fiscus, J. Ajot and J. S. Garofolo, “The rich transcription 2007 meeting recognition evaluation,” in International Evaluation Workshop on Rich Transcription. Springer, 2007, pp. 373–389.
  7. “The chime-7 dasr challenge: Distant meeting transcription with multiple devices in diverse scenarios,” CHiME Workshop, 2023.
  8. “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in Proc. ICASSP, 2022.
  9. “The stc system for the chime-6 challenge,” in CHiME Workshop, 2020.
  10. “The ustc-nelslip systems for chime-6 challenge,” in CHiME Workshop, 2020.
  11. “Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis,” in IEEE SLT, 2021.
  12. “Summary on the icassp 2022 multi-channel multi-party meeting transcription grand challenge,” in Proc. ICASSP. IEEE, 2022, pp. 9156–9160.
  13. “Ts-sep: Joint diarization and separation conditioned on estimated speaker embeddings,” arXiv preprint arXiv:2303.03849, 2023.
  14. “EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” in IEEE SLT. IEEE, 2023, pp. 480–487.
  15. “Continuous speech separation: Dataset and analysis,” in Proc. ICASSP, 2020.
  16. “VarArray: Array-geometry-agnostic continuous speech separation,” in Proc. ICASSP, 2022.
  17. “Low-latency speech separation guided diarization for telephone conversations,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 641–646.
  18. “VarArray meets t-SOT: Advancing the state of the art of streaming distant conversational speech recognition,” ArXiv, 2022.
  19. “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
  20. “Serialized output training for end-to-end overlapped speech recognition,” in Proc. InterSpeech, 2020.
  21. “End-to-end multi-speaker speech recognition with transformer,” in Proc. ICASSP. IEEE, 2020, pp. 6134–6138.
  22. “Streaming end-to-end multi-talker speech recognition,” IEEE SPL, vol. 28, 2021.
  23. “Continuous streaming multi-talker asr with dual-path transducers,” in Proc. ICASSP. IEEE, 2022, pp. 7317–7321.
  24. “Streaming multi-talker ASR with token-level serialized output training,” in Proc. InterSpeech, 2022.
  25. “Front-end processing for the CHiME-5 dinner party scenario,” in CHiME5 Workshop, 2018.
  26. “Single channel target speaker extraction and recognition with speaker beam,” in Proc. ICASSP, 2018.
  27. “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” in Proc. InterSpeech, 2020.
  28. “Investigation of end-to-end speaker-attributed asr for continuous multi-talker recordings,” in Proc. SLT, 2021.
  29. “Streaming speaker-attributed asr with token-level speaker embeddings,” Proc. InterSpeech, 2022.
  30. “Streaming end-to-end target speaker ASR,” in Proc. InterSpeech, 2022.
  31. “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” in Proc. ICASSP, 2023.
  32. “Transcribe-to-diarize: Neural speaker diarization for unlimited number of speakers using end-to-end speaker-attributed ASR,” in Proc. ICASSP. IEEE, 2022, pp. 8082–8086.
  33. “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, 2022.
  34. “Robust speech recognition via large-scale weak supervision,” ArXiv, 2022.
  35. “Meeteval: A toolkit for computation of word error rates for meeting transcription systems,” arXiv preprint arXiv:2307.11394, 2023.
  36. K. Kinoshita, M. Delcroix and N. Tawara, “Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds,” in Proc. ICASSP, 2021.
  37. K. Kinoshita, M. Delcroix and N. Tawara, “Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech,” Proc. InterSpeech, 2021.
  38. “The AMI meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction, 2005.
  39. “Bayesian hmm clustering of x-vector sequences (vbx) in speaker diarization: theory, implementation and analysis on standard tasks,” Computer Speech & Language, vol. 71, pp. 101254, 2022.
  40. “End-to-end integration of speech recognition, speech enhancement, and self-supervised learning representation,” in Proc. InterSpeech, 2022.
  41. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
  42. “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71.
  43. D. Raj, Z. Huang and S. Khudanpur, “Multi-class spectral clustering with overlaps for speaker diarization,” in Proc. SLT, 2021.
  44. “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
  45. “The sins database for detection of daily activities in a home environment using an acoustic sensor network,” DCASE Workshop, pp. 1–5, 2017.
  46. R. Scheibler, E. Bezzam and I. Dokmanić, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in Proc. of ICASSP. IEEE, 2018, pp. 351–355.
  47. O. Kuchaiev et al., “NeMo: a toolkit for building AI applications using neural modules,” in Proc. Systems for ML Worshop, NeurIPS, 2019.
  48. “The mixer 6 corpus: Resources for cross-channel and text independent speaker recognition,” in LREC, 2010.
  49. “Adam: a method for stochastic optimization,” in ICLR, 2014.
  50. “Specaugment on large scale datasets,” in Proc. of ICASSP. IEEE, 2020, pp. 6879–6883.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Samuele Cornell (41 papers)
  2. Jee-weon Jung (69 papers)
  3. Shinji Watanabe (416 papers)
  4. Stefano Squartini (17 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.