Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Target Speech Extraction with Pre-trained AV-HuBERT and Mask-And-Recover Strategy (2403.16078v1)

Published 24 Mar 2024 in cs.SD and eess.AS

Abstract: Audio-visual target speech extraction (AV-TSE) is one of the enabling technologies in robotics and many audio-visual applications. One of the challenges of AV-TSE is how to effectively utilize audio-visual synchronization information in the process. AV-HuBERT can be a useful pre-trained model for lip-reading, which has not been adopted by AV-TSE. In this paper, we would like to explore the way to integrate a pre-trained AV-HuBERT into our AV-TSE system. We have good reasons to expect an improved performance. To benefit from the inter and intra-modality correlations, we also propose a novel Mask-And-Recover (MAR) strategy for self-supervised learning. The experimental results on the VoxCeleb2 dataset show that our proposed model outperforms the baselines both in terms of subjective and objective metrics, suggesting that the pre-trained AV-HuBERT model provides more informative visual cues for target speech extraction. Furthermore, through a comparative study, we confirm that the proposed Mask-And-Recover strategy is significantly effective.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. X. Yue, J. Ao, X. Gao, and H. Li, “Token2vec: A joint self-supervised pre-training framework using unpaired speech and text,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  2. R. Tao, K. A. Lee, Z. Shi, and H. Li, “Speaker recognition with two-step multi-modal deep cleansing,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  3. S. Ouyang, R. Ye, and L. Li, “WACO: Word-aligned contrastive learning for speech translation,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Association for Computational Linguistics, Jul. 2023, pp. 3891–3907.
  4. X. Chen, Q. Huang, X. Wu, Z. Wu, and H. Meng, “Hilvoice: Human-in-the-loop style selection for elder-facing speech synthesis,” in 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP).   IEEE, 2022, pp. 86–90.
  5. N. Mesgarani, “Robust speech processing in human auditory cortex,” The Journal of the Acoustical Society of America, vol. 143, no. 3, pp. 1744–1744, 2018.
  6. M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
  7. Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  8. K. Zmolikova, M. Delcroix, T. Ochiai, K. Kinoshita, J. Černocký, and D. Yu, “Neural target speech extraction: An overview,” IEEE Signal Processing Magazine, vol. 40, no. 3, pp. 8–29, 2023.
  9. Z. Pan, R. Tao, C. Xu, and H. Li, “Selective listening by synchronizing speech with lips,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1–1, 01 2022.
  10. K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Černocký, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
  11. Q. Wang, H. Muckenhirn, K. W. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. López-Moreno, “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Interspeech, 2018.
  12. C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1370–1384, 2020.
  13. M. Ge, C. Xu, L. Wang, C. E. Siong, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” in Interspeech, 2020.
  14. J. Li, M. Ge, Z. Pan, R. Cao, L. Wang, J. Dang, and S. Zhang, “Rethinking the Visual Cues in Audio-Visual Speaker Extraction,” in Proc. INTERSPEECH 2023, 2023, pp. 3754–3758.
  15. Z. Pan, W. Wang, M. Borsdorf, and H. Li, “Imaginenet: Target speaker extraction with intermittent visual cue through embedding inpainting,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  16. R. Gao and K. Grauman, “Visualvoice: Audio-visual speech separation with cross-modal consistency,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15 490–15 500, 2021.
  17. A. rahman Mohamed, H. yi Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, T. N. Sainath, and S. Watanabe, “Self-supervised speech representation learning: A review,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1179–1210, 2022.
  18. X. Chen, C. Song, Y. Zhou, Z. Wu, C. Chen, Z. Wu, and H. Meng, “A character-level span-based model for mandarin prosodic structure prediction,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 7602–7606.
  19. X. Chen, S. Lei, Z. Wu, D. Xu, W. Zhao, and H. Meng, “Unsupervised multi-scale expressive speaking style modeling with hierarchical context information for audiobook speech synthesis,” in Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 7193–7202.
  20. X. Chen, X. Wang, S. Zhang, L. He, Z. Wu, X. Wu, and H. Meng, “Stylespeech: Self-supervised style enhancing with vq-vae-based pre-training for expressive audiobook speech synthesis,” arXiv preprint arXiv:2312.12181, 2023.
  21. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 12 449–12 460.
  22. W.-N. Hsu, B. Bolte, Y.-H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. PP, pp. 1–1, 10 2021.
  23. Z. Huang, S. Watanabe, S. wen Yang, L. P. García-Perera, and S. Khudanpur, “Investigating self-supervised learning for speech enhancement and separation,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6837–6841, 2022.
  24. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, M. Zeng, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, pp. 1505–1518, 2021.
  25. Z. Chen, N. Kanda, J. Wu, Y. Wu, X. Wang, T. Yoshioka, J. Li, S. Sivasankaran, and S. E. Eskimez, “Speech separation with large-scale self-supervised learning,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  26. X. Chen, Y. Wang, X. Wu, D. Wang, Z. Wu, X. Liu, and H. Meng, “Exploiting audio-visual features with pretrained av-hubert for multi-modal dysarthric speech reconstruction,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024.
  27. B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” in International Conference on Learning Representations, 2022.
  28. Z. Pan, R. Tao, C. Xu, and H. Li, “Muse: Multi-modal target speaker extraction with visual cues,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6678–6682.
  29. J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu, “Time domain audio visual speech separation,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 667–673.
  30. Z. Pan, M. Ge, and H. Li, “Usev: Universal speaker extraction with visual cue,” IEEE/ACM Trans. Audio, Speech and Lang. Proc., vol. 30, p. 3032–3045, sep 2022.
  31. S. Otake, R. Kawakami, and N. Inoue, “Parameter efficient transfer learning for various speech processing tasks,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  32. L. Meng, J. Kang, M. Cui, Y. Wang, X. Wu, and H. Meng, “A sidecar separator can convert a single-talker speech recognition system to a multi-talker one,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  33. T. Afouras, J. S. Chung, and A. Zisserman, “My lips are concealed: Audio-visual speech enhancement through obstructions,” in Interspeech, 2019.
  34. S. Chung, S. Choe, J. S. Chung, and H. Kang, “Facefilter: Audio-visual speech separation using still images,” in Interspeech 2020.   ISCA, pp. 3481–3485.
  35. J. F. Montesinos, D. Michelsanti, G. Haro, Z.-H. Tan, and J. Jensen, “Speech inpainting: Context-based speech synthesis guided by video,” in Proc. INTERSPEECH 2023, 2023, pp. 4459–4463.
  36. H. Bear and R. Harvey, “Phoneme-to-viseme mappings: The good, the bad, and the ugly,” Speech Communication, 07 2017.
  37. K. Das, J. Jiang, and J. Rao, “Mean squared error of empirical predictor,” 2004.
  38. J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH, 2018.
  39. J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr – half-baked or well done?” ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630, 2018.
  40. S. Basterrech, G. Rubino, and M. Varela, “Single-sided Real-time PESQ Score Estimation,” arXiv e-prints, p. arXiv:1212.6350, Dec. 2012.
  41. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. R. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, pp. 2125–2136, 2011.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Wenxuan Wu (16 papers)
  2. Xueyuan Chen (20 papers)
  3. Xixin Wu (85 papers)
  4. Haizhou Li (285 papers)
  5. Helen Meng (204 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com