Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge (2409.02041v2)

Published 3 Sep 2024 in eess.AS and cs.SD

Abstract: This technical report outlines our submission system for the CHiME-8 NOTSOFAR-1 Challenge. The primary difficulty of this challenge is the dataset recorded across various conference rooms, which captures real-world complexities such as high overlap rates, background noises, a variable number of speakers, and natural conversation styles. To address these issues, we optimized the system in several aspects: For front-end speech signal processing, we introduced a data-driven joint training method for diarization and separation (JDS) to enhance audio quality. Additionally, we also integrated traditional guided source separation (GSS) for multi-channel track to provide complementary information for the JDS. For back-end speech recognition, we enhanced Whisper with WavLM, ConvNeXt, and Transformer innovations, applying multi-task training and Noise KLD augmentation, to significantly advance ASR robustness and accuracy. Our system attained a Time-Constrained minimum Permutation Word Error Rate (tcpWER) of 14.265% and 22.989% on the CHiME-8 NOTSOFAR-1 Dev-set-2 multi-channel and single-channel tracks, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. T. Higuchi, N. Ito, S. Araki, T. Yoshioka, M. Delcroix, and T. Nakatani, “Online MVDR beamformer based on complex Gaussian mixture model with spatial prior for noise robust ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 4, pp. 780–793, 2017.
  2. A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, p. 101027, 2020.
  3. M.-K. He, J. Du, Q.-F. Liu, and C.-H. Lee, “ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1561–1573, 2023.
  4. G. Yang, M. He, S. Niu, R. Wang, Y. Yue, S. Qian, S. Wu, J. Du, and C.-H. Lee, “Neural speaker diarization using memory-aware multi-speaker embedding with sequence-to-sequence architecture,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 11 626–11 630.
  5. R. Wang, M. He, J. Du, H. Zhou, S. Niu, H. Chen, Y. Yue, G. Yang, S. Wu, L. Sun, Y. Tu, H. Tang, S. Qian, T. Gao, M. Wang, G. Wan, J. Pan, J. Gao, and C.-H. Lee, “The ustc-nercslip systems for the chime-7 dasr challenge,” ArXiv, vol. abs/2308.14638, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:261244449
  6. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” ArXiv, vol. abs/2212.04356, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:252923993
  7. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, M. Zeng, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:239885872
  8. Y. Jiang, J. Yu, W. Yang, B. Zhang, and Y. Wang, “Nextformer: A convnext augmented conformer for end-to-end speech recognition,” 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:250113612
  9. P. Zhou, R. Fan, W. Chen, and J. Jia, “Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding,” ArXiv, vol. abs/1911.00203, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:207870654
  10. Z. Dai, H. Liu, Q. V. Le, and M. Tan, “Coatnet: Marrying convolution and attention for all data sizes,” ArXiv, vol. abs/2106.04803, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235376986
  11. Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,” ArXiv, vol. abs/2307.08621, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259937453
  12. B. Peng, E. Alcaide, Q. G. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, G. Kranthikiran, X. He, H. Hou, P. Kazienko, J. Kocoń, J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom, A. Saito, X. Tang, B. Wang, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, J. Zhu, and R. Zhu, “Rwkv: Reinventing rnns for the transformer era,” in Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258832459
  13. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” ArXiv, vol. abs/2005.08100, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218674528
  14. Z. You, S. Feng, D. Su, and D. Yu, “Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts,” in Interspeech, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:234094030
  15. “CHiME-8 Simulated Training Dataset,” https://www.chimechallenge.org/current/task2/datasimulated-training-dataset, 2024.
  16. “CHiME-8 Meetings Recordings Dataset,” https://www.chimechallenge.org/current/task2/datasimulated-training-dataset, 2024.
  17. D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
  18. W. Ji, S. Zan, G. Zhou, and X. Wang, “Research on an improved conformer end-to-end speech recognition model with r-drop structure,” ArXiv, vol. abs/2306.08329, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259165369
Citations (3)

Summary

We haven't generated a summary for this paper yet.