Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction (2306.06495v1)
Abstract: This paper describes an audio-visual speech enhancement (AV-SE) method that estimates from noisy input audio a mixture of the speech of the speaker appearing in an input video (on-screen target speech) and of a selected speaker not appearing in the video (off-screen target speech). Although conventional AV-SE methods have suppressed all off-screen sounds, it is necessary to listen to a specific pre-known speaker's speech (e.g., family member's voice and announcements in stations) in future applications of AV-SE (e.g., hearing aids), even when users' sight does not capture the speaker. To overcome this limitation, we extract a visual clue for the on-screen target speech from the input video and a voiceprint clue for the off-screen one from a pre-recorded speech of the speaker. Two clues from different domains are integrated as an audio-visual clue, and the proposed model directly estimates the target mixture. To improve the estimation accuracy, we introduce a temporal attention mechanism for the voiceprint clue and propose a training strategy called the muting strategy. Experimental results show that our method outperforms a baseline method that uses the state-of-the-art AV-SE and speaker extraction methods individually in terms of estimation accuracy and computational efficiency.
- T. Afouras, et al., “The conversation: Deep audio-visual speech enhancement,” in Interspeech, 2018, pp. 3244-3248.
- M. Gogate, et al., “CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement,” Information Fusion, vol. 63, pp. 273–285, 2020.
- D. Michelsanti, et al., “An overview of deep-learning based audio-visual speech enhancement and separation,” IEEE/ACM TASLP, vol. 29, pp. 1368–1396, 2021.
- K. Yang, et al., “Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis,” in IEEE/CVF CVPR, 2022, pp. 8227–8237.
- A. Ephrat, et al., “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM TOG, vol. 37, no. 4, pp. 1–11, 2018.
- W. H. Sumby and I. Pollack, “Visual contribution to speech intelligibility in noise,” The Journal of the Acoustical Society of America, vol. 26, no. 2, pp. 212–215, 1954.
- L. Girin, et al., “Audio-visual enhancement of speech in noise,” The Journal of the Acoustical Society of America, vol. 109, no. 6, pp. 3007–3020, 2001.
- T. Afouras, et al., “My lips are concealed: Audio-visual speech enhancement through obstructions,” in Interspeech, 2019, pp. 4295–4299.
- H. Sato, et al., “Multimodal attention fusion for target speaker extraction,” in IEEE SLT, 2021, pp. 778–784.
- Q. Wang, et al., “VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking,” in Interspeech, 2019, pp. 2728–2732.
- M. Ge, et al., “Multi-stage speaker extraction with utterance and frame-level reference signals,” in IEEE ICASSP, 2021, pp. 6109-6113.
- T. Ochiai, et al., “Listen to what you want: Neural network-based universal sound selector,” in Interspeech, 2020, pp. 1441–1445.
- S. Liu, et al., “N-HANS: A neural network-based toolkit for in-the-wild audio enhancement,” Multimedia Tools and Applications, vol. 80, pp. 28365–28389, 2021.
- S. Pascual, et al., “SEGAN: Speech enhancement generative adversarial network,” in Interspeech, 2017, pp. 3642–3646.
- Y. Hu, et al., “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” in Interspeech, 2020, pp. 2472–2476.
- Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM TASLP, vol. 27, no. 8, pp. 1256–1266, 2019.
- Z. Pan, et al., “Selective listening by synchronizing speech with lips,” IEEE/ACM TASLP, vol. 30, pp. 1650–1664, 2022.
- E. Vincent, et al., “Performance measurement in blind audio source separation,” IEEE/ACM TASLP, vol. 14, no. 4, pp. 1462–1469, 2006.
- S.-W. Chung, et al., “FaceFilter: Audio-visual speech separation using still images,” in Interspeech, 2020, pp. 3481–3485.
- Y. Hao, et al., “Wase: Learning when to attend for speaker extraction in cocktail party environments,” in IEEE ICASSP, 2021, pp. 6104–6108.
- J. S. Chung, et al., “VoxCeleb2: Deep speaker recognition,” in Interspeech, 2018, pp. 1086–1090.
- J. Garofolo, et al., “CSR-I (WSJ0) Complete LDC93S6A,” Philadelphia: Linguistic Data Consortium, 1993.
- J. F. Gemmeke, et al., “Audio Set: An ontology and human-labeled dataset for audio events,” in IEEE ICASSP, 2017, pp. 776–780.
- Z. Pan, et al., “Muse: Multi-modal target speaker extraction with visual cues,” in IEEE ICASSP, 2021, pp. 6678–6682.
- J. Wu, et al., “Time domain audio visual speech separation,” in IEEE ASRU, 2019, pp. 667–673.
- T. Afouras, et al., “Deep audio-visual speech recognition,” IEEE TPAMI, 2018.
- J. L. Roux, et al., “SDR–half-baked or well done?,” in IEEE ICASSP, 2019, pp. 626–630.
- Tomoya Yoshinaga (1 paper)
- Keitaro Tanaka (8 papers)
- Shigeo Morishima (33 papers)