Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-end Audio-visual Speech Recognition with Conformers (2102.06657v1)

Published 12 Feb 2021 in cs.CV and eess.AS

Abstract: In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms, respectively, which are then fed to conformers and then fusion takes place via a Multi-Layer Perceptron (MLP). The model learns to recognise characters using a combination of CTC and an attention mechanism. We show that end-to-end training, instead of using pre-computed visual features which is common in the literature, the use of a conformer, instead of a recurrent network, and the use of a transformer-based LLM, significantly improve the performance of our model. We present results on the largest publicly available datasets for sentence-level speech recognition, Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3), respectively. The results show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.

End-to-End Audio-Visual Speech Recognition with Conformers

The research presented explores an innovative approach to Audio-Visual Speech Recognition (AVSR) through an end-to-end model architecture, leveraging the expressiveness of Conformers combined with a hybrid Connectionist Temporal Classification (CTC) and attention mechanism. This work builds upon previous models by integrating feature extraction and recognition into a singular deep learning framework, potentially enhancing the understanding and transcription of speech across varied noise conditions.

Model Architecture and Training Approach

The core model architecture comprises a ResNet-based encoder for extracting acoustic and visual features from raw input data, followed by a Conformer-based back-end, and concluding with a Multi-Layer Perceptron (MLP) for feature fusion. The adoption of Conformers—a convolution-augmented transformer variant—permits sophisticated temporal modeling. The integration of a transformer-based LLM further bolsters the sequence prediction capabilities of the hybrid CTC/Attention model. Through rigorous ablation studies, this integrated approach has demonstrated marked improvements in word error rate (WER) for the visual-only, audio-only, and audio-visual settings over prior state-of-the-art methods, as evidenced by results on the LRS2 and LRS3 datasets.

Empirical Results

Empirical evaluations using the challenging LRS2 and LRS3 datasets indicate that this proposed model considerably surpasses existing benchmarks. Specifically, with clean and noisy audio conditions, the audio-only model, trained on raw audio waveforms, effectively competes with models leveraging pre-computed log-Mel filter-bank features. Strikingly, the audio-visual model achieves even greater robustness, particularly in high-noise environments, thereby underscoring the advantage of multi-modal integration. The reported absolute WER reductions suggest a promising direction for AVSR tasks under less-than-ideal circumstances.

Implications and Future Directions

This paper implies significant advancements in AVSR systems by reducing dependency on pre-calculated input features, thereby simplifying the pipeline and improving adaptivity under varying conditions. The findings reinforce the potential of direct raw data feeding, which not only broadens the applicability of AVSR models in real-world scenarios but also lays the groundwork for further exploration of adaptive modality fusion based on environmental noise levels. Potential future research might delve into dynamic weighting schemas for modality integration and extending these methodologies to other languages or dialects. Such efforts could reshape traditional speech recognition paradigms and enhance human-computer interaction in complex auditory environments.

In essence, this work opens new avenues for research and application in audio-visual machine learning frameworks by demonstrating the efficacy of Conformers in conjunction with end-to-end training methodologies. The results emphasize the value of jointly optimizing feature extraction and recognition, pointing towards refined and more inclusive AI-driven communication technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Pingchuan Ma (90 papers)
  2. Stavros Petridis (64 papers)
  3. Maja Pantic (100 papers)
Citations (199)