Modality Attention for End-to-End Audio-visual Speech Recognition (1811.05250v2)

Published 13 Nov 2018 in cs.CL, cs.CV, cs.SD, and eess.AS

Abstract: Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for robust speech recognition, especially in noisy environment. In this paper, we propose a novel multimodal attention based method for audio-visual speech recognition which could automatically learn the fused representation from both modalities based on their importance. Our method is realized using state-of-the-art sequence-to-sequence (Seq2seq) architectures. Experimental results show that relative improvements from 2% up to 36% over the auditory modality alone are obtained depending on the different signal-to-noise-ratio (SNR). Compared to the traditional feature concatenation methods, our proposed approach can achieve better recognition performance under both clean and noisy conditions. We believe modality attention based end-to-end method can be easily generalized to other multimodal tasks with correlated information.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (5)

Pan Zhou (220 papers)
Wenwen Yang (4 papers)
Wei Chen (1288 papers)
Yanfeng Wang (211 papers)
Jia Jia (59 papers)

Citations (66)

View on Semantic Scholar

Modality Attention for End-to-End Audio-visual Speech Recognition (1811.05250v2)

Related Papers