Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Concurrent Speaker Detection: A multi-microphone Transformer-Based Approach (2403.06856v1)

Published 11 Mar 2024 in eess.AS

Abstract: We present a deep-learning approach for the task of Concurrent Speaker Detection (CSD) using a modified transformer model. Our model is designed to handle multi-microphone data but can also work in the single-microphone case. The method can classify audio segments into one of three classes: 1) no speech activity (noise only), 2) only a single speaker is active, and 3) more than one speaker is active. We incorporate a Cost-Sensitive (CS) loss and a confidence calibration to the training procedure. The approach is evaluated using three real-world databases: AMI, AliMeeting, and CHiME 5, demonstrating an improvement over existing approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. S. E. Chazan, J. Goldberger, and S. Gannot, “LCMV beamformer with DNN-based multichannel concurrent speakers detector,” in 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 1562–1566.
  2. M. Yousefi and J. H. Hansen, “Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network,” in Proc. Interspeech 2021, 2021, pp. 1484–1488.
  3. N. Kanda, Y. Gaur, et al., “Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers,” in Proc. Interspeech 2020, 2020, pp. 36–40.
  4. N. Sajjan, S. Ganesh, et al., “Leveraging lstm models for overlap detection in multi-party meetings,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5249–5253.
  5. L. Bullock, H. Bredin, and L. P. Garcia-Perera, “Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7114–7118.
  6. A. Gillioz, J. Casas, et al., “Overview of the transformer-based models for NLP tasks,” in 15th Conference on Computer Science and Information Systems (FedCSIS), 2020, pp. 179–183.
  7. A. Vaswani, N. Shazeer, et al., “Attention is all you need,” Advances in neural information processing systems (NeurIPS), vol. 30, 2017.
  8. C. Subakan, M. Ravanelli, et al., “Attention is all you need in speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 21–25.
  9. Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” in Proc. Interspeech, 2021, pp. 571–575.
  10. A. Dosovitskiy, L. Beyer, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
  11. S. Cornell, M. Omologo, et al., “Detecting and counting overlapping speakers in distant speech scenarios,” in Proc. Interspeech, Shanghai, China, Oct. 2020.
  12. ——, “Overlapped speech detection and speaker counting using distant microphone arrays,” Computer Speech & Language, vol. 72, p. 101306, 2022.
  13. S. Zheng, S. Zhang, et al., “Beamtransformer: Microphone array-based overlapping speech detection,” arXiv preprint arXiv:2109.04049, 2021.
  14. M. Kyoung, H. Jeon, and K. Park, “Audio-visual overlapped speech detection for spontaneous distant speech,” IEEE Access, vol. 11, pp. 27 426–27 432, 2023.
  15. J. Barker, S. Watanabe, et al., “The fifth ‘chime’ speech separation and recognition challenge: Dataset, task and baselines,” in Proceedings Interspeech, Hyderabad, India, Sept. 2018.
  16. F. Yu, S. Zhang, et al., “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
  17. R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing help?” Advances in neural information processing systems (NeurIPS), vol. 32, 2019.
  18. A. Galdran, J. Dolz, et al., “Cost-sensitive regularization for diabetic retinopathy grading from eye fundus images,” in Medical Image Computing and Computer Assisted Intervention (MICCAI), 2020, pp. 665–674.
  19. M. Kumar, M. Dehghani, and N. Houlsby, “Dual PatchNorm,” Transactions on Machine Learning Research, 2023.
  20. C. Guo, G. Pleiss, et al., “On calibration of modern neural networks,” in Proceedings of the 34th International Conference on Machine Learning, vol. 70, Aug. 2017, pp. 1321–1330.
  21. H. Bredin and A. Laurent, “End-to-end speaker segmentation for overlap-aware resegmentation,” arXiv preprint arXiv:2104.04045, 2021.
Citations (2)

Summary

We haven't generated a summary for this paper yet.