Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications (2403.06570v2)

Published 11 Mar 2024 in cs.CL

Abstract: Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. “M2met: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6167–6171.
  2. “Advances in online audio-visual meeting transcription,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 276–283.
  3. “Meeting transcription using asynchronous distant microphones.,” in Interspeech, 2019, pp. 2968–2972.
  4. “The CHiME-7 DASR Challenge: Distant meeting transcription with multiple devices in diverse scenarios,” in 7th International Workshop on Speech Processing in Everyday Environments (CHiME), 2023, pp. 1–6.
  5. “End-to-end multi-speaker speech recognition with Transformer,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6134–6138.
  6. “A purely end-to-end system for multi-speaker speech recognition,” in 56th Annual Meeting of the ACL (Volume 1: Long Papers), 2018, pp. 2620–2630.
  7. “End-to-end speaker-attributed ASR with Transformer,” in Interspeech, 2021, pp. 4413–4417.
  8. “A comparative study on speaker-attributed automatic speech recognition in multi-party meetings,” in Interspeech, 2022, pp. 560–564.
  9. “Hypothesis stitcher for end-to-end speaker-attributed ASR on long-form multi-talker recordings,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6763–6767.
  10. “Investigation of end-to-end speaker-attributed ASR for continuous multi-talker recordings,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 809–816.
  11. “Transcribe-to-diarize: Neural speaker diarization for unlimited number of speakers using end-to-end speaker-attributed ASR,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8082–8086.
  12. “Streaming speaker-attributed ASR with token-level speaker embeddings,” in Interspeech, 2022, pp. 521–525.
  13. “Multi-speaker ASR combining non-autoregressive Conformer CTC and conditional speaker chain,” in Interspeech, 2021, pp. 3720–3724.
  14. “Large-scale pre-training of end-to-end multi-talker ASR for meeting transcription with single distant microphone,” in Interspeech, 2021, pp. 3430–3434.
  15. “End-to-end multichannel speaker-attributed ASR: Speaker guided decoder and input feature analysis,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8.
  16. “Simulating realistic speech overlaps improves multi-talker ASR,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  17. “The AMI meeting corpus: A pre-announcement,” in International Workshop on Machine Learning for Multimodal Interaction, 2005, pp. 28–39.
  18. “Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario,” in Interspeech, 2020, pp. 274–278.
  19. “Comparative study on voice activity detection algorithm,” in 2010 International Conference on Electrical and Control Engineering, 2010, pp. 599–602.
  20. “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, pp. 101317, 2022.
  21. “Microsoft speaker diarization system for the VoxCeleb speaker recognition challenge 2020,” in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5824–5828.
  22. “X-vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
  23. “Emotion recognition in VAD space during emotional events using CNN-GRU hybrid model on EEG signals,” in International Conference on Intelligent Human Computer Interaction. Springer, 2022, pp. 75–84.
  24. “A 47-nW voice activity detector (VAD) featuring a short-time CNN feature extractor and an RNN-Based classifier with a non-volatile CAP-ROM,” IEEE Journal of Solid-State Circuits, vol. 58, pp. 3020–3029, 2023.
  25. “Vowel based voice activity detection with LSTM recurrent neural network,” in Proceedings of the 8th International Conference on Signal Processing Systems, 2016, pp. 134–137.
  26. “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4580–4584.
  27. “Fast CRDNN: Towards on site training of mobile construction machines,” IEEE Access, vol. 9, pp. 124253–124267, 2021.
  28. “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
  29. “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 4052–4056.
  30. “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Interspeech, 2020, pp. 3830–3834.
  31. “Serialized output training for end-to-end overlapped speech recognition,” in Interspeech, 2020, pp. 2797–2801.
  32. “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  33. “gpuRIR: A Python library for room impulse response simulation with GPU acceleration,” Multimedia Tools and Applications, vol. 80, pp. 5653–5671, 2021.
  34. “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” in Interspeech, 2020, pp. 36–40.
  35. “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
  36. M. Ravanelli, “Libriparty,” https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriParty/generate_dataset, 2023, GitHub repository.
  37. “A spectral clustering approach to speaker diarization,” in Interspeech, 2006, pp. 2178–2181.
  38. “VoxSRC 2019: The first VoxCeleb speaker recognition challenge,” arXiv preprint arXiv:1912.02522, 2019.
  39. “VoxSRC 2020: The second VoxCeleb speaker recognition challenge,” arXiv e-prints, pp. arXiv–2012, 2020.
  40. “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66–71.
  41. NIST, “SCTK,” https://github.com/usnistgov/SCTK.git, 2024, GitHub repository.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Can Cui (96 papers)
  2. Imran Ahamad Sheikh (4 papers)
  3. Mostafa Sadeghi (28 papers)
  4. Emmanuel Vincent (44 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets