MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition (2401.03424v3)
Abstract: While automatic speech recognition (ASR) systems degrade significantly in noisy environments, audio-visual speech recognition (AVSR) systems aim to complement the audio stream with noise-invariant visual cues and improve the system's robustness. However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. Experimental results on the MISP2022-AVSR Challenge dataset show the efficacy of our proposed system, achieving a concatenated minimum permutation character error rate (cpCER) of 30.57% on the Eval set and yielding up to 3.17% relative improvement compared with our previous system which ranked the second place in the challenge. Following the fusion of multiple systems, our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
- “Achieving human parity in conversational speech recognition,” arXiv preprint arXiv:1610.05256, 2016.
- “Watch or Listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring,” in Proc. CVPR. IEEE/CVF, 2023, pp. 18783–18794.
- “Auto-AVSR: Audio-visual speech recognition with automatic labels,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- “End-to-end audio-visual speech recognition with conformers,” in Proc. ICASSP. IEEE, 2021, pp. 7613–7617.
- “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proc. ICASSP. IEEE, 2017, pp. 4835–4839.
- “Deep residual learning for image recognition,” in Proc. CVPR. IEEE/CVF, 2016, pp. 770–778.
- “Conformer: Convolution-augmented Transformer for speech recognition,” in Proc. Interspeech. ISCA, 2020, pp. 5036–5040.
- “Attention-based audio-visual fusion for robust automatic speech recognition,” in Proc. MI. ACM, 2018, pp. 111–115.
- “Attentive fusion enhanced audio-visual encoding for Transformer based robust speech recognition,” in Proc. APSIPA ASC. IEEE, 2020, pp. 638–643.
- “Audio-visual multi-Talker speech recognition in a cocktail party,” in Proc. Interspeech. ISCA, 2021, pp. 3021–3025.
- “VE-KWS: Visual modality enhanced end-to-end keyword spotting,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- “Robust audio-visual ASR with unified cross-modal attention,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- “The DKU audio-visual wake word spotting system for the 2021 MISP challenge,” in Proc. ICASSP. IEEE, 2022, pp. 9256–9260.
- “The XMU system for audio-visual diarization and recognition in MISP challenge 2022,” in Proc. ICASSP. IEEE, 2023, pp. 1–2.
- “Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis,” in Proc. Interspeech. ISCA, 2022, pp. 1766–1770.
- “The first multimodal information based speech processing (MISP) challenge: Data, tasks, baselines and results,” in Proc. ICASSP. IEEE, 2022, pp. 9266–9270.
- “The multimodal information based speech processing (MISP) 2022 challenge: Audio-visual diarization and recognition,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
- “The NIO System for audio-visual diarization and recognition in MISP challenge 2022,” in Proc. ICASSP. IEEE, 2023, pp. 1–2.
- “The NPU-ASLP system for audio-visual speech recognition in MISP 2022 challenge,” in Proc. ICASSP. IEEE, 2023, pp. 1–2.
- “Intermediate loss regularization for CTC-based speech recognition,” in Proc. ICASSP. IEEE, 2021, pp. 6224–6228.
- “Branchformer: Parallel MLP-attention architectures to capture local and global context for speech recognition and understanding,” in Proc. ICML. PMLR, 2022, pp. 17627–17643.
- “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML. PMLR, 2006, pp. 369–376.
- “Attention is all you need,” in Proc. NIPS. 2017, vol. 30, Curran Associates, Inc.
- “E-branchformer: Branchformer with enhanced merging for speech recognition,” in Proc. SLT. IEEE, 2023, pp. 84–91.
- “Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening,” TASLP, vol. 20, no. 10, pp. 2707–2720, 2012.
- “Front-end processing for the CHiME-5 dinner party scenario,” in Proc. CHiME5 Workshop, 2018.
- “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
- “Tsup speaker diarization system for conversational short-phrase speaker diarization challenge,” in Proc. ISCSLP. IEEE, 2022, pp. 502–506.
- “Espnet: End-to-end speech processing toolkit,” in Proc. Interspeech. ISCA, 2018, pp. 2207–2211.
- Jonathan G Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER),” in Proc. ASRU. IEEE, 1997, pp. 347–354.
- He Wang (294 papers)
- Pengcheng Guo (55 papers)
- Pan Zhou (220 papers)
- Lei Xie (337 papers)