Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications (2403.06570v2)
Abstract: Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.
- “M2met: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6167–6171.
- “Advances in online audio-visual meeting transcription,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 276–283.
- “Meeting transcription using asynchronous distant microphones.,” in Interspeech, 2019, pp. 2968–2972.
- “The CHiME-7 DASR Challenge: Distant meeting transcription with multiple devices in diverse scenarios,” in 7th International Workshop on Speech Processing in Everyday Environments (CHiME), 2023, pp. 1–6.
- “End-to-end multi-speaker speech recognition with Transformer,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6134–6138.
- “A purely end-to-end system for multi-speaker speech recognition,” in 56th Annual Meeting of the ACL (Volume 1: Long Papers), 2018, pp. 2620–2630.
- “End-to-end speaker-attributed ASR with Transformer,” in Interspeech, 2021, pp. 4413–4417.
- “A comparative study on speaker-attributed automatic speech recognition in multi-party meetings,” in Interspeech, 2022, pp. 560–564.
- “Hypothesis stitcher for end-to-end speaker-attributed ASR on long-form multi-talker recordings,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6763–6767.
- “Investigation of end-to-end speaker-attributed ASR for continuous multi-talker recordings,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 809–816.
- “Transcribe-to-diarize: Neural speaker diarization for unlimited number of speakers using end-to-end speaker-attributed ASR,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8082–8086.
- “Streaming speaker-attributed ASR with token-level speaker embeddings,” in Interspeech, 2022, pp. 521–525.
- “Multi-speaker ASR combining non-autoregressive Conformer CTC and conditional speaker chain,” in Interspeech, 2021, pp. 3720–3724.
- “Large-scale pre-training of end-to-end multi-talker ASR for meeting transcription with single distant microphone,” in Interspeech, 2021, pp. 3430–3434.
- “End-to-end multichannel speaker-attributed ASR: Speaker guided decoder and input feature analysis,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8.
- “Simulating realistic speech overlaps improves multi-talker ASR,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “The AMI meeting corpus: A pre-announcement,” in International Workshop on Machine Learning for Multimodal Interaction, 2005, pp. 28–39.
- “Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario,” in Interspeech, 2020, pp. 274–278.
- “Comparative study on voice activity detection algorithm,” in 2010 International Conference on Electrical and Control Engineering, 2010, pp. 599–602.
- “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, pp. 101317, 2022.
- “Microsoft speaker diarization system for the VoxCeleb speaker recognition challenge 2020,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5824–5828.
- “X-vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
- “Emotion recognition in VAD space during emotional events using CNN-GRU hybrid model on EEG signals,” in International Conference on Intelligent Human Computer Interaction. Springer, 2022, pp. 75–84.
- “A 47-nW voice activity detector (VAD) featuring a short-time CNN feature extractor and an RNN-Based classifier with a non-volatile CAP-ROM,” IEEE Journal of Solid-State Circuits, vol. 58, pp. 3020–3029, 2023.
- “Vowel based voice activity detection with LSTM recurrent neural network,” in Proceedings of the 8th International Conference on Signal Processing Systems, 2016, pp. 134–137.
- “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4580–4584.
- “Fast CRDNN: Towards on site training of mobile construction machines,” IEEE Access, vol. 9, pp. 124253–124267, 2021.
- “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
- “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4052–4056.
- “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Interspeech, 2020, pp. 3830–3834.
- “Serialized output training for end-to-end overlapped speech recognition,” in Interspeech, 2020, pp. 2797–2801.
- “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- “gpuRIR: A Python library for room impulse response simulation with GPU acceleration,” Multimedia Tools and Applications, vol. 80, pp. 5653–5671, 2021.
- “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” in Interspeech, 2020, pp. 36–40.
- “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
- M. Ravanelli, “Libriparty,” https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriParty/generate_dataset, 2023, GitHub repository.
- “A spectral clustering approach to speaker diarization,” in Interspeech, 2006, pp. 2178–2181.
- “VoxSRC 2019: The first VoxCeleb speaker recognition challenge,” arXiv preprint arXiv:1912.02522, 2019.
- “VoxSRC 2020: The second VoxCeleb speaker recognition challenge,” arXiv e-prints, pp. arXiv–2012, 2020.
- “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71.
- NIST, “SCTK,” https://github.com/usnistgov/SCTK.git, 2024, GitHub repository.
- Can Cui (96 papers)
- Imran Ahamad Sheikh (4 papers)
- Mostafa Sadeghi (28 papers)
- Emmanuel Vincent (44 papers)