Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation (2404.11275v1)
Abstract: In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.
- Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: Dataset and analysis,” in IEEE ICASSP, 2020, pp. 7284–7288.
- J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in IEEE ICASSP, 2016, pp. 31–35.
- M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
- Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- D. Yu, X. Chang, and Y. Qian, “Recognizing Multi-Talker Speech with Permutation Invariant Training,” in Proc. Interspeech 2017, 2017, pp. 2456–2460.
- S. Settle, J. Le Roux, T. Hori, S. Watanabe, and J. R. Hershey, “End-to-end multi-speaker speech recognition,” in IEEE ICASSP, 2018, pp. 4819–4823.
- X. Chang, Y. Qian, K. Yu, and S. Watanabe, “End-to-end monaural multi-speaker asr system without pretraining,” in IEEE ICASSP, 2019, pp. 6256–6260.
- L. Lu, N. Kanda, J. Li, and Y. Gong, “Streaming end-to-end multi-talker speech recognition,” IEEE Signal Processing Letters, vol. 28, pp. 803–807, 2021.
- D. Raj, L. Lu, Z. Chen, Y. Gaur, and J. Li, “Continuous streaming multi-talker asr with dual-path transducers,” in IEEE ICASSP, 2022, pp. 7317–7321.
- N. Kanda, Y. Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Serialized Output Training for End-to-End Overlapped Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 2797–2801.
- C. Li, L. Zhu, S. Xu, P. Gao, and B. Xu, “Cbldnn-based speaker-independent speech separation via generative adversarial training,” in IEEE ICASSP, 2018, pp. 711–715.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in IEEE ICASSP, 2020, pp. 46–50.
- L. Zhang, C. Li, F. Deng, and X. Wang, “Multi-task audio source separation,” in IEEE ASRU, 2021, pp. 671–678.
- D. Petermann, G. Wichern, Z.-Q. Wang, and J. Le Roux, “The cocktail fork problem: Three-stem audio separation for real-world soundtracks,” in IEEE ICASSP, 2022, pp. 526–530.
- C. Li, Y. Wang, F. Deng, Z. Zhang, X. Wang, and Z. Wang, “Ead-conformer: a conformer-based encoder-attention-decoder-network for multi-task audio source separation,” in IEEE ICASSP, 2022, pp. 521–525.
- Y. Wang, C. Li, F. Deng, S. Lu, P. Yao, J. Tan, C. Song, and X. Wang, “Wa-transformer: Window attention-based transformer with two-stage strategy for multi-task audio source separation,” Proc. Interspeech 2022, pp. 5373–5377, 2022.
- D. Petermann, G. Wichern, A. S. Subramanian, Z.-Q. Wang, and J. Le Roux, “Tackling the cocktail fork problem for separation and transcription of real-world soundtracks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2592–2605, 2023.
- J. Li et al., “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, 2022.
- X. Gao, C. Gupta, and H. Li, “Automatic lyrics transcription of polyphonic music with lyrics-chord multi-task learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2280–2294, 2022.
- G. R. Dabike and J. Barker, “Automatic lyric transcription from karaoke vocal tracks: Resources and a baseline system.” in Proc. Interspeech 2019, 2019, pp. 579–583.
- W. Zhang, X. Chang, Y. Qian, and S. Watanabe, “Improving end-to-end single-channel multi-talker speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1385–1394, 2020.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 5036–5040.
- Z. Yao, D. Wu, X. Wang, B. Zhang, F. Yu, C. Yang, Z. Peng, X. Chen, L. Xie, and X. Lei, “WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit,” in Proc. Interspeech 2021, 2021, pp. 4054–4058.
- Z. Tian, J. Yi, Y. Bai, J. Tao, S. Zhang, and Z. Wen, “One in a hundred: Selecting the best predicted sequence from numerous candidates for speech recognition,” in IEEE APSIPA ASC, 2021, pp. 454–459.
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006, pp. 369–376.
- J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong, “Large-Scale Domain Adaptation via Teacher-Student Learning,” in Proc. Interspeech 2017, 2017, pp. 2386–2390.
- J. Yi and J. Tao, “Distilling knowledge for distant speech recognition via parallel data,” in IEEE APSIPA ASC, 2019, pp. 170–175.
- A. Narayanan, J. Walker, S. Panchapagesan, N. Howard, and Y. Koizumi, “Learning mask scalars for improved robust automatic speech recognition,” in IEEE SLT, 2023, pp. 317–323.
- Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in ICML, 2009, pp. 41–48.
- H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in IEEE O-COCOSDA, 2017, pp. 1–5.
- R. Huang, F. Chen, Y. Ren, J. Liu, C. Cui, and Z. Zhao, “Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus,” in ACM MM, 2021, pp. 3945–3954.
- Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “Musdb18-hq - an uncompressed version of musdb18,” Aug. 2019. [Online]. Available: https://doi.org/10.5281/zenodo.3338373
- E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.