2000 character limit reached
Concurrent Speaker Detection: A multi-microphone Transformer-Based Approach (2403.06856v1)
Published 11 Mar 2024 in eess.AS
Abstract: We present a deep-learning approach for the task of Concurrent Speaker Detection (CSD) using a modified transformer model. Our model is designed to handle multi-microphone data but can also work in the single-microphone case. The method can classify audio segments into one of three classes: 1) no speech activity (noise only), 2) only a single speaker is active, and 3) more than one speaker is active. We incorporate a Cost-Sensitive (CS) loss and a confidence calibration to the training procedure. The approach is evaluated using three real-world databases: AMI, AliMeeting, and CHiME 5, demonstrating an improvement over existing approaches.
- S. E. Chazan, J. Goldberger, and S. Gannot, “LCMV beamformer with DNN-based multichannel concurrent speakers detector,” in 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 1562–1566.
- M. Yousefi and J. H. Hansen, “Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network,” in Proc. Interspeech 2021, 2021, pp. 1484–1488.
- N. Kanda, Y. Gaur, et al., “Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of any Number of Speakers,” in Proc. Interspeech 2020, 2020, pp. 36–40.
- N. Sajjan, S. Ganesh, et al., “Leveraging lstm models for overlap detection in multi-party meetings,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5249–5253.
- L. Bullock, H. Bredin, and L. P. Garcia-Perera, “Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7114–7118.
- A. Gillioz, J. Casas, et al., “Overview of the transformer-based models for NLP tasks,” in 15th Conference on Computer Science and Information Systems (FedCSIS), 2020, pp. 179–183.
- A. Vaswani, N. Shazeer, et al., “Attention is all you need,” Advances in neural information processing systems (NeurIPS), vol. 30, 2017.
- C. Subakan, M. Ravanelli, et al., “Attention is all you need in speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 21–25.
- Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” in Proc. Interspeech, 2021, pp. 571–575.
- A. Dosovitskiy, L. Beyer, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations (ICLR), 2021.
- S. Cornell, M. Omologo, et al., “Detecting and counting overlapping speakers in distant speech scenarios,” in Proc. Interspeech, Shanghai, China, Oct. 2020.
- ——, “Overlapped speech detection and speaker counting using distant microphone arrays,” Computer Speech & Language, vol. 72, p. 101306, 2022.
- S. Zheng, S. Zhang, et al., “Beamtransformer: Microphone array-based overlapping speech detection,” arXiv preprint arXiv:2109.04049, 2021.
- M. Kyoung, H. Jeon, and K. Park, “Audio-visual overlapped speech detection for spontaneous distant speech,” IEEE Access, vol. 11, pp. 27 426–27 432, 2023.
- J. Barker, S. Watanabe, et al., “The fifth ‘chime’ speech separation and recognition challenge: Dataset, task and baselines,” in Proceedings Interspeech, Hyderabad, India, Sept. 2018.
- F. Yu, S. Zhang, et al., “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
- R. Müller, S. Kornblith, and G. E. Hinton, “When does label smoothing help?” Advances in neural information processing systems (NeurIPS), vol. 32, 2019.
- A. Galdran, J. Dolz, et al., “Cost-sensitive regularization for diabetic retinopathy grading from eye fundus images,” in Medical Image Computing and Computer Assisted Intervention (MICCAI), 2020, pp. 665–674.
- M. Kumar, M. Dehghani, and N. Houlsby, “Dual PatchNorm,” Transactions on Machine Learning Research, 2023.
- C. Guo, G. Pleiss, et al., “On calibration of modern neural networks,” in Proceedings of the 34th International Conference on Machine Learning, vol. 70, Aug. 2017, pp. 1321–1330.
- H. Bredin and A. Laurent, “End-to-end speaker segmentation for overlap-aware resegmentation,” arXiv preprint arXiv:2104.04045, 2021.