Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks (1709.00944v5)

Published 1 Sep 2017 in cs.SD, cs.MM, eess.AS, and stat.ML

Abstract: Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of convolutional neural networks (CNNs) in SE, we propose an audio-visual deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. We also propose a multi-task learning framework for reconstructing audio and visual signals at the output layer. Precisely speaking, the proposed AVDCNN model is structured as an audio-visual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer. The model is trained in an end-to-end manner, and parameters are jointly learned through back-propagation. We evaluate enhanced speech using five instrumental criteria. Results show that the AVDCNN model yields a notably superior performance compared with an audio-only CNN-based SE model and two conventional SE approaches, confirming the effectiveness of integrating visual information into the SE process. In addition, the AVDCNN model also outperforms an existing audio-visual SE model, confirming its capability of effectively combining audio and visual information in SE.

Citations (192)

View on Semantic Scholar

Summary

The paper proposes an AVDCNN model that fuses audio and visual streams to enhance speech quality in noisy settings.
It utilizes a dual-task deep CNN architecture that processes separate audio and visual inputs before joint fusion for improved reconstruction.
Experimental results demonstrate that the model outperforms traditional methods by achieving superior objective speech quality metrics.

Overview of Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks

The paper "Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks" introduces an innovative approach to speech enhancement (SE) by integrating audio and visual information. This research is grounded upon the premise that incorporating multimodal data can effectively improve the intelligibility and quality of speech signals in noisy environments. The authors propose an audio-visual deep convolutional neural network (AVDCNN) model, which utilizes convolutional neural networks (CNNs) to process and fuse both audio and visual streams, thus achieving superior performance in SE compared to conventional audio-only systems.

Proposed Methodology

The AVDCNN model is designed as an audio-visual encoder-decoder network. Initially, separate CNNs process the audio and visual data before fusing them into a joint network. The network's primary task is to enhance speech, while the secondary task involves reconstructing corresponding visual outputs. This dual-task learning enables the model to leverage visual features to aid speech enhancement, a novel approach not extensively explored in previous studies.

The architecture is trained end-to-end with back-propagation, with parameters optimized to minimize the error in both audio and visual reconstructions. The model shows notable efficiency in processing non-stationary noise types, which conventional SE techniques struggle to handle effectively.

Experimental Setup and Results

The dataset used includes video recordings of Mandarin sentences with corresponding visual data from the speaker's mouth region. The authors employ various noise conditions in their experiments to simulate real-world scenarios, using objective measures such as PESQ, STOI, SDI, HASQI, and HASPI to evaluate performance.

The AVDCNN model consistently outperformed baseline models, including conventional KLT and logMMSE techniques, as well as an audio-visual SE model based on deep neural networks (AVDNN). Notably, the paper highlights the model's ability to leverage visual information, achieving improved results over audio-only CNN-based models.

The research further explores different fusion schemes, showing that late fusion, where audio and visual data are processed separately before integration, yields better results than early fusion. Additionally, the effectiveness of multi-task learning is emphasized, demonstrating its advantages over single-task approaches in handling complex SE tasks.

Implications and Future Directions

The incorporation of visual information into the SE process offers significant theoretical and practical implications. From a theoretical standpoint, it underscores the potential of multimodal deep learning in improving SE outcomes. Practically, the AVDCNN model could enhance applications such as automatic speech and speaker recognition systems, hearing aids, and in-car voice command systems, where background noise is a prevalent challenge.

Future research directions include expanding the model to use entire facial images instead of only the mouth region, exploiting advanced CNN architectures like fully convolutional networks and U-Net, and improving synchronization between audio and visual streams. Further exploration into real-world application scenarios is warranted to refine the model's effectiveness under diverse conditions.

In summary, this research makes a substantial contribution to the field of speech enhancement by effectively integrating audio-visual data using deep learning frameworks, laying the groundwork for future advancements in multimodal speech processing technologies.

PDF Markdown