Papers
Topics
Authors
Recent
Search
2000 character limit reached

End-to-end Audio-visual Speech Recognition with Conformers

Published 12 Feb 2021 in cs.CV and eess.AS | (2102.06657v1)

Abstract: In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based LLM, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.

Citations (199)

Summary

  • The paper presents an end-to-end AVSR model integrating Conformers with a hybrid CTC/Attention mechanism, significantly reducing word error rates.
  • The model employs a ResNet-based encoder and a Conformer back-end to fuse audio and visual features effectively.
  • Empirical results on LRS2 and LRS3 datasets demonstrate enhanced robustness in noisy conditions and performance improvements over state-of-the-art methods.

End-to-End Audio-Visual Speech Recognition with Conformers

The research presented explores an innovative approach to Audio-Visual Speech Recognition (AVSR) through an end-to-end model architecture, leveraging the expressiveness of Conformers combined with a hybrid Connectionist Temporal Classification (CTC) and attention mechanism. This work builds upon previous models by integrating feature extraction and recognition into a singular deep learning framework, potentially enhancing the understanding and transcription of speech across varied noise conditions.

Model Architecture and Training Approach

The core model architecture comprises a ResNet-based encoder for extracting acoustic and visual features from raw input data, followed by a Conformer-based back-end, and concluding with a Multi-Layer Perceptron (MLP) for feature fusion. The adoption of Conformers—a convolution-augmented transformer variant—permits sophisticated temporal modeling. The integration of a transformer-based LLM further bolsters the sequence prediction capabilities of the hybrid CTC/Attention model. Through rigorous ablation studies, this integrated approach has demonstrated marked improvements in word error rate (WER) for the visual-only, audio-only, and audio-visual settings over prior state-of-the-art methods, as evidenced by results on the LRS2 and LRS3 datasets.

Empirical Results

Empirical evaluations using the challenging LRS2 and LRS3 datasets indicate that this proposed model considerably surpasses existing benchmarks. Specifically, with clean and noisy audio conditions, the audio-only model, trained on raw audio waveforms, effectively competes with models leveraging pre-computed log-Mel filter-bank features. Strikingly, the audio-visual model achieves even greater robustness, particularly in high-noise environments, thereby underscoring the advantage of multi-modal integration. The reported absolute WER reductions suggest a promising direction for AVSR tasks under less-than-ideal circumstances.

Implications and Future Directions

This paper implies significant advancements in AVSR systems by reducing dependency on pre-calculated input features, thereby simplifying the pipeline and improving adaptivity under varying conditions. The findings reinforce the potential of direct raw data feeding, which not only broadens the applicability of AVSR models in real-world scenarios but also lays the groundwork for further exploration of adaptive modality fusion based on environmental noise levels. Potential future research might explore dynamic weighting schemas for modality integration and extending these methodologies to other languages or dialects. Such efforts could reshape traditional speech recognition paradigms and enhance human-computer interaction in complex auditory environments.

In essence, this work opens new avenues for research and application in audio-visual machine learning frameworks by demonstrating the efficacy of Conformers in conjunction with end-to-end training methodologies. The results emphasize the value of jointly optimizing feature extraction and recognition, pointing towards refined and more inclusive AI-driven communication technologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.