Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention (2404.18501v2)

Published 29 Apr 2024 in eess.AS and cs.SD

Abstract: Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel reverse selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We conduct abundant experiments by re-implementing three popular AV-TSE methods as the baselines and involving nine metrics for evaluation. The experimental results show that our proposed SEANet achieves state-of-the-art results and performs well for all five datasets. We will release the codes, the models and data logs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. E. C. Cherry, “Some experiments on the recognition of speech, with one and with two ears,” The Journal of the acoustical society of America, vol. 25, no. 5, pp. 975–979, 1953.
  2. E. Z. Golumbic, G. B. Cogan, C. E. Schroeder, and D. Poeppel, “Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”,” Journal of Neuroscience, vol. 33, no. 4, pp. 1417–1426, 2013.
  3. R. Cutler and L. Davis, “Look who’s talking: Speaker detection using video and audio correlation,” in Int. Conf. Multimedia and Expo, vol. 3.   IEEE, 2000, pp. 1589–1592.
  4. T. Afouras, J. S. Chung, and A. Zisserman, “The Conversation: deep audio-visual speech enhancement,” Interspeech, pp. 3244–3248, 2018.
  5. Z. Pan, R. Tao, C. Xu, and H. Li, “Muse: Multi-modal target speaker extraction with visual cues,” in ICASSP.   IEEE, 2021, pp. 6678–6682.
  6. J. Lin, X. Cai, H. Dinkel, J. Chen, Z. Yan, Y. Wang, J. Zhang, Z. Wu, Y. Wang, and H. Meng, “Av-sepformer: Cross-attention sepformer for audio-visual target speaker extraction,” in ICASSP.   IEEE, 2023.
  7. J. Wang, X. Qian, and H. Li, “Predict-and-update network: Audio-visual speech recognition inspired by human speech perception,” arXiv preprint arXiv:2209.01768, 2022.
  8. B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” arXiv preprint arXiv:2201.02184, 2022.
  9. M. Cheng and M. Li, “Multi-input multi-output target-speaker voice activity detection for unified, flexible, and robust audio-visual speaker diarization,” arXiv preprint arXiv:2401.08052, 2024.
  10. R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection,” in ACM Int. Conf. Multimedia, 2021, p. 3927–3935.
  11. R. Tao, K. A. Lee, R. K. Das, V. Hautamäki, and H. Li, “Self-supervised training of speaker encoder with multi-modal diverse positive pairs,” IEEE Trans. on Audio, Speech, and Language Processing, 2023.
  12. D. Cai, W. Wang, and M. Li, “Incorporating visual information in audio based self-supervised speaker recognition,” IEEE Trans. on Audio, Speech, and Language Processing, 2022.
  13. M. Dianatfar, J. Latokartano, and M. Lanz, “Review on existing vr/ar solutions in human–robot collaboration,” Procedia CIRP, vol. 97, pp. 407–411, 2021.
  14. J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, and D. Yu, “Time domain audio visual speech separation,” in IEEE Automatic Speech Recognition and Understanding Workshop.   IEEE, 2019, pp. 667–673.
  15. J. Li, M. Ge, Z. Pan, R. Cao, L. Wang, J. Dang, and S. Zhang, “Rethinking the Visual Cues in Audio-Visual Speaker Extraction,” in Interspeech, 2023, pp. 3754–3758.
  16. M. J. Crosse, G. M. Di Liberto, and E. C. Lalor, “Eye can hear clearly now: inverse effectiveness in natural audiovisual speech processing relies on long-term crossmodal temporal integration,” Journal of Neuroscience, vol. 36, no. 38, pp. 9888–9895, 2016.
  17. H. L. Bear and R. Harvey, “Phoneme-to-viseme mappings: the good, the bad, and the ugly,” Speech Communication, vol. 95, pp. 40–67, 2017.
  18. R. Gao and K. Grauman, “Visualvoice: Audio-visual speech separation with cross-modal consistency,” in IEEE Conf. Comput. Vis. Pattern Recog.   IEEE, 2021, pp. 15 490–15 500.
  19. K. Li, R. Yang, and X. Hu, “An efficient encoder-decoder architecture with top-down attention for speech separation,” in The Eleventh International Conference on Learning Representations, 2023.
  20. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  21. Z. Pan, R. Tao, C. Xu, and H. Li, “Selective listening by synchronizing speech with lips,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 1650–1664, 2022.
  22. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  23. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP.   IEEE, 2020, pp. 46–50.
  24. Z. Pan, M. Ge, and H. Li, “Usev: Universal speaker extraction with visual cue,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 30, pp. 3032–3045, 2022.
  25. Z. Zhao, D. Yang, R. Gu, H. Zhang, and Y. Zou, “Target confusion in end-to-end speaker extraction: Analysis and approaches,” Interspeech, 2022.
  26. S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salient object detection,” in Eur. Conf. Comput. Vis., 2018, pp. 234–250.
  27. C. Zheng, X. Peng, Y. Zhang, S. Srinivasan, and Y. Lu, “Interactive speech and noise modeling for speech enhancement,” in AAAI, vol. 35, 2021, pp. 14 549–14 557.
  28. Y. Hu, N. Hou, C. Chen, and E. S. Chng, “Interactive feature fusion for end-to-end noise-robust speech recognition,” in ICASSP.   IEEE, 2022, pp. 6292–6296.
  29. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” Interspeech, 2019.
  30. M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” in ICASSP.   IEEE, 2018, pp. 5554–5558.
  31. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
  32. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in ICASSP.   IEEE, 2021, pp. 21–25.
  33. D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in ICASSP.   IEEE, 2017, pp. 241–245.
  34. C. Xu, W. Rao, E. S. Chng, and H. Li, “Time-domain speaker extraction network,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2019, pp. 327–334.
  35. ——, “Spex: Multi-scale time domain speaker extraction network,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 28, pp. 1370–1384, 2020.
  36. M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” Interspeech, 2020.
  37. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
  38. D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2016, pp. 165–170.
  39. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP.   IEEE, 2018, pp. 4879–4883.
  40. Z. Mu, X. Yang, S. Sun, and Q. Yang, “Self-supervised disentangled representation learning for robust target speech extraction,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 18 815–18 823.
  41. Z. Zhang, B. He, and Z. Zhang, “X-tasnet: Robust and accurate time-domain speaker extraction network,” in Interspeech, 2020, pp. 1421–1425.
  42. M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “L-spex: Localized target speaker extraction,” in ICASSP.   IEEE, 2022, pp. 7287–7291.
  43. H. Sato, T. Ochiai, K. Kinoshita, M. Delcroix, T. Nakatani, and S. Araki, “Multimodal attention fusion for target speaker extraction,” in IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2021, pp. 778–784.
  44. H. Lin, Y. Zhuang, Y. Huang, X. Ding, X. Liu, and Y. Yu, “Noise2grad: Extract image noise to denoise.” in IJCAI, 2021, pp. 830–836.
  45. D.-P. Fan, G.-P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, “Pranet: Parallel reverse attention network for polyp segmentation,” in International conference on medical image computing and computer-assisted intervention.   Springer, 2020, pp. 263–273.
  46. T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 2018.
  47. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in ICASSP.   IEEE, 2019, pp. 626–630.
  48. J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Interspeech, 2018, pp. 1086–1090.
  49. T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
  50. N. Alghamdi, S. Maddock, R. Marxer, J. Barker, and G. J. Brown, “A corpus of audio-visual lombard speech with frontal and profile views,” The Journal of the Acoustical Society of America, vol. 143, no. 6, pp. EL523–EL529, 2018.
  51. N. Harte and E. Gillen, “TCD-TIMIT: An audio-visual corpus of continuous speech,” IEEE Trans. Multimedia, vol. 17, no. 5, pp. 603–615, 2015.
  52. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP, vol. 2.   IEEE, 2001, pp. 749–752.
  53. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in ICASSP.   IEEE, 2010, pp. 4214–4217.
  54. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, 2023, pp. 28 492–28 518.
  55. J. Lee, S.-W. Chung, S. Kim, H.-G. Kang, and K. Sohn, “Looking into your speech: Learning cross-modal affinity for audio-visual speech separation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1336–1345.
  56. T. Afouras, A. Owens, J. S. Chung, and A. Zisserman, “Self-supervised learning of audio-visual objects from video,” in Eur. Conf. Comput. Vis., 2020.
  57. S. W. Chung, S. Choe, J. S. Chung, and H. G. Kang, “Facefilter: Audio-visual speech separation using still images,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2020, 2020, pp. 3481–3485.
  58. S. Lee, C. Jung, Y. Jang, J. Kim, and J. S. Chung, “Seeing through the conversation: Audio-visual speech separation based on diffusion model,” arXiv preprint arXiv:2310.19581, 2023.
  59. A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Transactions on Graphics, vol. 37, no. 4, pp. 112:1–112:11, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ruijie Tao (25 papers)
  2. Xinyuan Qian (30 papers)
  3. Yidi Jiang (18 papers)
  4. Junjie Li (98 papers)
  5. Jiadong Wang (19 papers)
  6. Haizhou Li (286 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.