Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings (2210.17456v3)

Published 31 Oct 2022 in eess.AS and cs.SD

Abstract: AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal self-supervised embeddings obtained from AV-HuBERT can be generalized to audio-visual regression tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. I-Chun Chern (5 papers)
  2. Kuo-Hsuan Hung (22 papers)
  3. Yi-Ting Chen (53 papers)
  4. Tassadaq Hussain (9 papers)
  5. Mandar Gogate (21 papers)
  6. Amir Hussain (75 papers)
  7. Yu Tsao (199 papers)
  8. Jen-Cheng Hou (7 papers)
Citations (13)