The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge (2409.02041v2)
Abstract: This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several aspects: For front-end speech signal processing, we introduced a data-driven joint training method for diarization and separation (JDS) to enhance audio quality. Additionally, we also integrated traditional guided source separation (GSS) for multi-channel track to provide complementary information for the JDS. For back-end speech recognition, we enhanced Whisper with WavLM, ConvNeXt, and Transformer innovations, applying multi-task training and Noise KLD augmentation, to significantly advance ASR robustness and accuracy. Our system attained a Time-Constrained minimum Permutation Word Error Rate (tcpWER) of 14.265% and 22.989% on the CHiME-8 NOTSOFAR-1 Dev-set-2 multi-channel and single-channel tracks, respectively.
- T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and T. Nakatani, “Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 780–793, 2017.
- A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, p. 101027, 2020.
- M.-K. He, J. Du, Q.-F. Liu, and C.-H. Lee, “ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1561–1573, 2023.
- G. Yang, M. He, S. Niu, R. Wang, Y. Yue, S. Qian, S. Wu, J. Du, and C.-H. Lee, “Neural speaker diarization using memory-aware multi-speaker embedding with sequence-to-sequence architecture,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 626–11 630.
- R. Wang, M. He, J. Du, H. Zhou, S. Niu, H. Chen, Y. Yue, G. Yang, S. Wu, L. Sun, Y. Tu, H. Tang, S. Qian, T. Gao, M. Wang, G. Wan, J. Pan, J. Gao, and C.-H. Lee, “The ustc-nercslip systems for the chime-7 dasr challenge,” ArXiv, vol. abs/2308.14638, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:261244449
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” ArXiv, vol. abs/2212.04356, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:252923993
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, M. Zeng, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:239885872
- Y. Jiang, J. Yu, W. Yang, B. Zhang, and Y. Wang, “Nextformer: A convnext augmented conformer for end-to-end speech recognition,” 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:250113612
- P. Zhou, R. Fan, W. Chen, and J. Jia, “Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding,” ArXiv, vol. abs/1911.00203, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:207870654
- Z. Dai, H. Liu, Q. V. Le, and M. Tan, “Coatnet: Marrying convolution and attention for all data sizes,” ArXiv, vol. abs/2106.04803, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235376986
- Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,” ArXiv, vol. abs/2307.08621, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259937453
- B. Peng, E. Alcaide, Q. G. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, G. Kranthikiran, X. He, H. Hou, P. Kazienko, J. Kocoń, J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom, A. Saito, X. Tang, B. Wang, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, J. Zhu, and R. Zhu, “Rwkv: Reinventing rnns for the transformer era,” in Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258832459
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” ArXiv, vol. abs/2005.08100, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218674528
- Z. You, S. Feng, D. Su, and D. Yu, “Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts,” in Interspeech, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:234094030
- “CHiME-8 Simulated Training Dataset,” https://www.chimechallenge.org/current/task2/datasimulated-training-dataset, 2024.
- “CHiME-8 Meetings Recordings Dataset,” https://www.chimechallenge.org/current/task2/datasimulated-training-dataset, 2024.
- D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
- W. Ji, S. Zan, G. Zhou, and X. Wang, “Research on an improved conformer end-to-end speech recognition model with r-drop structure,” ArXiv, vol. abs/2306.08329, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259165369