Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy (2403.16078v1)
Abstract: Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the proposed Mask-And-Recover strategy is significantly effective.
- X. Yue, J. Ao, X. Gao, and H. Li, “Token2vec: A joint self-supervised pre-training framework using unpaired speech and text,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- R. Tao, K. A. Lee, Z. Shi, and H. Li, “Speaker recognition with two-step multi-modal deep cleansing,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- S. Ouyang, R. Ye, and L. Li, “WACO: Word-aligned contrastive learning for speech translation,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Jul. 2023, pp. 3891–3907.
- X. Chen, Q. Huang, X. Wu, Z. Wu, and H. Meng, “Hilvoice: Human-in-the-loop style selection for elder-facing speech synthesis,” in 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2022, pp. 86–90.
- N. Mesgarani, “Robust speech processing in human auditory cortex,” The Journal of the Acoustical Society of America, vol. 143, no. 3, pp. 1744–1744, 2018.
- M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
- Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- K. Zmolikova, M. Delcroix, T. Ochiai, K. Kinoshita, J. Černocký, and D. Yu, “Neural target speech extraction: An overview,” IEEE Signal Processing Magazine, vol. 40, no. 3, pp. 8–29, 2023.
- Z. Pan, R. Tao, C. Xu, and H. Li, “Selective listening by synchronizing speech with lips,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1–1, 01 2022.
- K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Černocký, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
- Q. Wang, H. Muckenhirn, K. W. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. López-Moreno, “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Interspeech, 2018.
- C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1370–1384, 2020.
- M. Ge, C. Xu, L. Wang, C. E. Siong, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” in Interspeech, 2020.
- J. Li, M. Ge, Z. Pan, R. Cao, L. Wang, J. Dang, and S. Zhang, “Rethinking the Visual Cues in Audio-Visual Speaker Extraction,” in Proc. INTERSPEECH 2023, 2023, pp. 3754–3758.
- Z. Pan, W. Wang, M. Borsdorf, and H. Li, “Imaginenet: Target speaker extraction with intermittent visual cue through embedding inpainting,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- R. Gao and K. Grauman, “Visualvoice: Audio-visual speech separation with cross-modal consistency,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15 490–15 500, 2021.
- A. rahman Mohamed, H. yi Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, T. N. Sainath, and S. Watanabe, “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1179–1210, 2022.
- X. Chen, C. Song, Y. Zhou, Z. Wu, C. Chen, Z. Wu, and H. Meng, “A character-level span-based model for mandarin prosodic structure prediction,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7602–7606.
- X. Chen, S. Lei, Z. Wu, D. Xu, W. Zhao, and H. Meng, “Unsupervised multi-scale expressive speaking style modeling with hierarchical context information for audiobook speech synthesis,” in Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 7193–7202.
- X. Chen, X. Wang, S. Zhang, L. He, Z. Wu, X. Wu, and H. Meng, “Stylespeech: Self-supervised style enhancing with vq-vae-based pre-training for expressive audiobook speech synthesis,” arXiv preprint arXiv:2312.12181, 2023.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460.
- W.-N. Hsu, B. Bolte, Y.-H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. PP, pp. 1–1, 10 2021.
- Z. Huang, S. Watanabe, S. wen Yang, L. P. García-Perera, and S. Khudanpur, “Investigating self-supervised learning for speech enhancement and separation,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6837–6841, 2022.
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, M. Zeng, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2021.
- Z. Chen, N. Kanda, J. Wu, Y. Wu, X. Wang, T. Yoshioka, J. Li, S. Sivasankaran, and S. E. Eskimez, “Speech separation with large-scale self-supervised learning,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- X. Chen, Y. Wang, X. Wu, D. Wang, Z. Wu, X. Liu, and H. Meng, “Exploiting audio-visual features with pretrained av-hubert for multi-modal dysarthric speech reconstruction,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
- B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” in International Conference on Learning Representations, 2022.
- Z. Pan, R. Tao, C. Xu, and H. Li, “Muse: Multi-modal target speaker extraction with visual cues,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6678–6682.
- J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu, “Time domain audio visual speech separation,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 667–673.
- Z. Pan, M. Ge, and H. Li, “Usev: Universal speaker extraction with visual cue,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 30, p. 3032–3045, sep 2022.
- S. Otake, R. Kawakami, and N. Inoue, “Parameter efficient transfer learning for various speech processing tasks,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- L. Meng, J. Kang, M. Cui, Y. Wang, X. Wu, and H. Meng, “A sidecar separator can convert a single-talker speech recognition system to a multi-talker one,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- T. Afouras, J. S. Chung, and A. Zisserman, “My lips are concealed: Audio-visual speech enhancement through obstructions,” in Interspeech, 2019.
- S. Chung, S. Choe, J. S. Chung, and H. Kang, “Facefilter: Audio-visual speech separation using still images,” in Interspeech 2020. ISCA, pp. 3481–3485.
- J. F. Montesinos, D. Michelsanti, G. Haro, Z.-H. Tan, and J. Jensen, “Speech inpainting: Context-based speech synthesis guided by video,” in Proc. INTERSPEECH 2023, 2023, pp. 4459–4463.
- H. Bear and R. Harvey, “Phoneme-to-viseme mappings: The good, the bad, and the ugly,” Speech Communication, 07 2017.
- K. Das, J. Jiang, and J. Rao, “Mean squared error of empirical predictor,” 2004.
- J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018.
- J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr – half-baked or well done?” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630, 2018.
- S. Basterrech, G. Rubino, and M. Varela, “Single-sided Real-time PESQ Score Estimation,” arXiv e-prints, p. arXiv:1212.6350, Dec. 2012.
- C. H. Taal, R. C. Hendriks, R. Heusdens, and J. R. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, pp. 2125–2136, 2011.
- Wenxuan Wu (16 papers)
- Xueyuan Chen (20 papers)
- Xixin Wu (85 papers)
- Haizhou Li (285 papers)
- Helen Meng (204 papers)