Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction (2306.06495v1)

Published 10 Jun 2023 in eess.AS and cs.SD

Abstract: This paper describes an audio-visual speech enhancement (AV-SE) method that estimates from noisy input audio a mixture of the speech of the speaker appearing in an input video (on-screen target speech) and of a selected speaker not appearing in the video (off-screen target speech). Although conventional AV-SE methods have suppressed all off-screen sounds, it is necessary to listen to a specific pre-known speaker's speech (e.g., family member's voice and announcements in stations) in future applications of AV-SE (e.g., hearing aids), even when users' sight does not capture the speaker. To overcome this limitation, we extract a visual clue for the on-screen target speech from the input video and a voiceprint clue for the off-screen one from a pre-recorded speech of the speaker. Two clues from different domains are integrated as an audio-visual clue, and the proposed model directly estimates the target mixture. To improve the estimation accuracy, we introduce a temporal attention mechanism for the voiceprint clue and propose a training strategy called the muting strategy. Experimental results show that our method outperforms a baseline method that uses the state-of-the-art AV-SE and speaker extraction methods individually in terms of estimation accuracy and computational efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. T. Afouras, et al., “The conversation: Deep audio-visual speech enhancement,” in Interspeech, 2018, pp. 3244-3248.
  2. M. Gogate, et al., “CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement,” Information Fusion, vol. 63, pp. 273–285, 2020.
  3. D. Michelsanti, et al., “An overview of deep-learning based audio-visual speech enhancement and separation,” IEEE/ACM TASLP, vol. 29, pp. 1368–1396, 2021.
  4. K. Yang, et al., “Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis,” in IEEE/CVF CVPR, 2022, pp. 8227–8237.
  5. A. Ephrat, et al., “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM TOG, vol. 37, no. 4, pp. 1–11, 2018.
  6. W. H. Sumby and I. Pollack, “Visual contribution to speech intelligibility in noise,” The Journal of the Acoustical Society of America, vol. 26, no. 2, pp. 212–215, 1954.
  7. L. Girin, et al., “Audio-visual enhancement of speech in noise,” The Journal of the Acoustical Society of America, vol. 109, no. 6, pp. 3007–3020, 2001.
  8. T. Afouras, et al., “My lips are concealed: Audio-visual speech enhancement through obstructions,” in Interspeech, 2019, pp. 4295–4299.
  9. H. Sato, et al., “Multimodal attention fusion for target speaker extraction,” in IEEE SLT, 2021, pp. 778–784.
  10. Q. Wang, et al., “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Interspeech, 2019, pp. 2728–2732.
  11. M. Ge, et al., “Multi-stage speaker extraction with utterance and frame-level reference signals,” in IEEE ICASSP, 2021, pp. 6109-6113.
  12. T. Ochiai, et al., “Listen to what you want: Neural network-based universal sound selector,” in Interspeech, 2020, pp. 1441–1445.
  13. S. Liu, et al., “N-HANS: A neural network-based toolkit for in-the-wild audio enhancement,” Multimedia Tools and Applications, vol. 80, pp. 28365–28389, 2021.
  14. S. Pascual, et al., “SEGAN: Speech enhancement generative adversarial network,” in Interspeech, 2017, pp. 3642–3646.
  15. Y. Hu, et al., “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” in Interspeech, 2020, pp. 2472–2476.
  16. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM TASLP, vol. 27, no. 8, pp. 1256–1266, 2019.
  17. Z. Pan, et al., “Selective listening by synchronizing speech with lips,” IEEE/ACM TASLP, vol. 30, pp. 1650–1664, 2022.
  18. E. Vincent, et al., “Performance measurement in blind audio source separation,” IEEE/ACM TASLP, vol. 14, no. 4, pp. 1462–1469, 2006.
  19. S.-W. Chung, et al., “FaceFilter: Audio-visual speech separation using still images,” in Interspeech, 2020, pp. 3481–3485.
  20. Y. Hao, et al., “Wase: Learning when to attend for speaker extraction in cocktail party environments,” in IEEE ICASSP, 2021, pp. 6104–6108.
  21. J. S. Chung, et al., “VoxCeleb2: Deep speaker recognition,” in Interspeech, 2018, pp. 1086–1090.
  22. J. Garofolo, et al., “CSR-I (WSJ0) Complete LDC93S6A,” Philadelphia: Linguistic Data Consortium, 1993.
  23. J. F. Gemmeke, et al., “Audio Set: An ontology and human-labeled dataset for audio events,” in IEEE ICASSP, 2017, pp. 776–780.
  24. Z. Pan, et al., “Muse: Multi-modal target speaker extraction with visual cues,” in IEEE ICASSP, 2021, pp. 6678–6682.
  25. J. Wu, et al., “Time domain audio visual speech separation,” in IEEE ASRU, 2019, pp. 667–673.
  26. T. Afouras, et al., “Deep audio-visual speech recognition,” IEEE TPAMI, 2018.
  27. J. L. Roux, et al., “SDR–half-baked or well done?,” in IEEE ICASSP, 2019, pp. 626–630.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Tomoya Yoshinaga (1 paper)
  2. Keitaro Tanaka (8 papers)
  3. Shigeo Morishima (33 papers)

Summary

We haven't generated a summary for this paper yet.