One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition (2310.01688v1)
Abstract: This paper presents a novel framework for joint speaker diarization (SD) and automatic speech recognition (ASR), named SLIDAR (sliding-window diarization-augmented recognition). SLIDAR can process arbitrary length inputs and can handle any number of speakers, effectively solving who spoke what, when'' concurrently. SLIDAR leverages a sliding window approach and consists of an end-to-end diarization-augmented speech transcription (E2E DAST) model which provides, locally, for each window: transcripts, diarization and speaker embeddings. The E2E DAST model is based on an encoder-decoder architecture and leverages recent techniques such as serialized output training and
Whisper-style" prompting. The local outputs are then combined to get the final SD+ASR result by clustering the speaker embeddings to get global speaker identities. Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
- “Nist rich transcription 2002 evaluation: A preview.,” in LREC, 2002.
- “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, pp. 101317, 2022.
- “End-to-end speech recognition: A survey,” arXiv preprint arXiv:2303.03329, 2023.
- “The fifth CHiME speech separation and recognition challenge: Dataset, task and baselines,” in Proc. InterSpeech, 2018.
- “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” in CHiME Workshop, 2020.
- J. G. Fiscus, J. Ajot and J. S. Garofolo, “The rich transcription 2007 meeting recognition evaluation,” in International Evaluation Workshop on Rich Transcription. Springer, 2007, pp. 373–389.
- “The chime-7 dasr challenge: Distant meeting transcription with multiple devices in diverse scenarios,” CHiME Workshop, 2023.
- “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in Proc. ICASSP, 2022.
- “The stc system for the chime-6 challenge,” in CHiME Workshop, 2020.
- “The ustc-nelslip systems for chime-6 challenge,” in CHiME Workshop, 2020.
- “Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis,” in IEEE SLT, 2021.
- “Summary on the icassp 2022 multi-channel multi-party meeting transcription grand challenge,” in Proc. ICASSP. IEEE, 2022, pp. 9156–9160.
- “Ts-sep: Joint diarization and separation conditioned on estimated speaker embeddings,” arXiv preprint arXiv:2303.03849, 2023.
- “EEND-SS: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” in IEEE SLT. IEEE, 2023, pp. 480–487.
- “Continuous speech separation: Dataset and analysis,” in Proc. ICASSP, 2020.
- “VarArray: Array-geometry-agnostic continuous speech separation,” in Proc. ICASSP, 2022.
- “Low-latency speech separation guided diarization for telephone conversations,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 641–646.
- “VarArray meets t-SOT: Advancing the state of the art of streaming distant conversational speech recognition,” ArXiv, 2022.
- “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
- “Serialized output training for end-to-end overlapped speech recognition,” in Proc. InterSpeech, 2020.
- “End-to-end multi-speaker speech recognition with transformer,” in Proc. ICASSP. IEEE, 2020, pp. 6134–6138.
- “Streaming end-to-end multi-talker speech recognition,” IEEE SPL, vol. 28, 2021.
- “Continuous streaming multi-talker asr with dual-path transducers,” in Proc. ICASSP. IEEE, 2022, pp. 7317–7321.
- “Streaming multi-talker ASR with token-level serialized output training,” in Proc. InterSpeech, 2022.
- “Front-end processing for the CHiME-5 dinner party scenario,” in CHiME5 Workshop, 2018.
- “Single channel target speaker extraction and recognition with speaker beam,” in Proc. ICASSP, 2018.
- “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” in Proc. InterSpeech, 2020.
- “Investigation of end-to-end speaker-attributed asr for continuous multi-talker recordings,” in Proc. SLT, 2021.
- “Streaming speaker-attributed asr with token-level speaker embeddings,” Proc. InterSpeech, 2022.
- “Streaming end-to-end target speaker ASR,” in Proc. InterSpeech, 2022.
- “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” in Proc. ICASSP, 2023.
- “Transcribe-to-diarize: Neural speaker diarization for unlimited number of speakers using end-to-end speaker-attributed ASR,” in Proc. ICASSP. IEEE, 2022, pp. 8082–8086.
- “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, 2022.
- “Robust speech recognition via large-scale weak supervision,” ArXiv, 2022.
- “Meeteval: A toolkit for computation of word error rates for meeting transcription systems,” arXiv preprint arXiv:2307.11394, 2023.
- K. Kinoshita, M. Delcroix and N. Tawara, “Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds,” in Proc. ICASSP, 2021.
- K. Kinoshita, M. Delcroix and N. Tawara, “Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech,” Proc. InterSpeech, 2021.
- “The AMI meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction, 2005.
- “Bayesian hmm clustering of x-vector sequences (vbx) in speaker diarization: theory, implementation and analysis on standard tasks,” Computer Speech & Language, vol. 71, pp. 101254, 2022.
- “End-to-end integration of speech recognition, speech enhancement, and self-supervised learning representation,” in Proc. InterSpeech, 2022.
- “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
- “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71.
- D. Raj, Z. Huang and S. Khudanpur, “Multi-class spectral clustering with overlaps for speaker diarization,” in Proc. SLT, 2021.
- “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
- “The sins database for detection of daily activities in a home environment using an acoustic sensor network,” DCASE Workshop, pp. 1–5, 2017.
- R. Scheibler, E. Bezzam and I. Dokmanić, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in Proc. of ICASSP. IEEE, 2018, pp. 351–355.
- O. Kuchaiev et al., “NeMo: a toolkit for building AI applications using neural modules,” in Proc. Systems for ML Worshop, NeurIPS, 2019.
- “The mixer 6 corpus: Resources for cross-channel and text independent speaker recognition,” in LREC, 2010.
- “Adam: a method for stochastic optimization,” in ICLR, 2014.
- “Specaugment on large scale datasets,” in Proc. of ICASSP. IEEE, 2020, pp. 6879–6883.
- Samuele Cornell (41 papers)
- Jee-weon Jung (69 papers)
- Shinji Watanabe (416 papers)
- Stefano Squartini (17 papers)