Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction (2306.06495v1)

Published 10 Jun 2023 in eess.AS and cs.SD

Abstract: This paper describes an audio-visual speech enhancement (AV-SE) method that estimates from noisy input audio a mixture of the speech of the speaker appearing in an input video (on-screen target speech) and of a selected speaker not appearing in the video (off-screen target speech). Although conventional AV-SE methods have suppressed all off-screen sounds, it is necessary to listen to a specific pre-known speaker's speech (e.g., family member's voice and announcements in stations) in future applications of AV-SE (e.g., hearing aids), even when users' sight does not capture the speaker. To overcome this limitation, we extract a visual clue for the on-screen target speech from the input video and a voiceprint clue for the off-screen one from a pre-recorded speech of the speaker. Two clues from different domains are integrated as an audio-visual clue, and the proposed model directly estimates the target mixture. To improve the estimation accuracy, we introduce a temporal attention mechanism for the voiceprint clue and propose a training strategy called the muting strategy. Experimental results show that our method outperforms a baseline method that uses the state-of-the-art AV-SE and speaker extraction methods individually in terms of estimation accuracy and computational efficiency.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (27)

Authors (3)

Tomoya Yoshinaga (1 paper)
Keitaro Tanaka (8 papers)
Shigeo Morishima (33 papers)

Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction (2306.06495v1)

Related Papers