Deep Multimodal Learning for Audio-Visual Speech Recognition (1501.05396v1)

Published 22 Jan 2015 in cs.CL and cs.LG

Abstract: In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of $41\%$ under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of $35.83\%$ demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio. Second, we present a new deep network architecture that uses a bilinear softmax layer to account for class specific correlations between modalities. We show that combining the posteriors from the bilinear networks with those from the fused model mentioned above results in a further significant phone error rate reduction, yielding a final PER of $34.03\%$.

Citations (216)

View on Semantic Scholar

Summary

The paper presents a novel multimodal fusion approach that combines separate deep networks for audio and visual inputs into a joint feature space.
It introduces a bilinear softmax layer to capture class-specific correlations, reducing the phone error rate from 41% to 34.03%.
The findings demonstrate that integrating visual cues significantly enhances speech recognition performance in noisy environments.

Deep Multi-Modal Learning for Audio-Visual Speech Recognition

The paper under consideration presents advancements in the field of Audio-Visual Automatic Speech Recognition (AV-ASR) by exploring deep multi-modal learning techniques to effectively fuse auditory and visual inputs. Given that human speech perception is inherently a multi-modal process involving auditory signals and visual cues such as lip movements, the paper seeks to mimic this integration in automated systems. The researchers introduce both a fusion model and a novel architecture to improve phoneme classification by considering both audio and visual modalities.

Core Methodological Contributions

The research is distinguished by its introduction of two principal methods for modality fusion. The initial methodology involves training separate deep networks for audio and visual channels. These networks’ final hidden layers are then concatenated to form a joint feature space where another deep network is subsequently trained. This approach illustrates the beneficial impact of incorporating visual signals even in acoustic conditions characterized by high signal-to-noise ratios, achieving a phone error rate (PER) of 35.83% compared to 41% with audio alone.

Building upon this, the authors propose a novel deep network architecture employing a bilinear softmax layer, which captures class-specific correlations between the modalities. This bilinear architecture enables joint training and provides nuanced integration of the visual and auditory inputs through a sophisticated backpropagation algorithm. Impressively, by combining the posteriors from the bilinear network with those from the fused model, they achieve a further reduced PER of 34.03%.

Numerical Results and Analysis

The performance improvements are quantitatively significant. Specifically, the fusion of audio and visual data reduced the PER from 41.25% (auditory only) to 34.03% in the integrated model setting. These findings underscore the non-trivial impact of visual data on improving speech recognition quality, an aspect that conventional models often overlook when operating in noise-rich environments.

Implications and Speculative Outlook

The implications of this research are manifold, encompassing practical enhancements in AV-ASR technologies and theoretical advancements in multi-modal deep learning. Practically, the integration of visual cues can be expected to augment the accuracy of speech recognition systems, particularly in environments afflicted by audio interference or multiplicity of speakers. Theoretically, the bilinear model introduced opens avenues for further exploration into joint representation learning and multi-modal data fusion. The discriminative approach aligns with recent trends in leveraging class-specific data interactions within multi-modal frameworks.

Looking to the future, the findings could influence the development of more robust AV-ASR systems capable of operating seamlessly in diverse application settings such as telecommunications, accessibility technologies, and assistive devices for the hearing impaired. Given the rapid advancements in computational capabilities and deep learning frameworks, ensuing research may focus on refining these models by exploring additional modalities, handling larger vocabulary sizes, or optimizing architectures for real-time processing.

In sum, this paper makes a noteworthy contribution to the audio-visual speech recognition domain through innovative multi-modal learning strategies and a focus on class-specific modality correlations. Further investigations will undoubtedly continue to bridge the gap between human-like perception and automated speech recognition systems.

PDF Markdown