Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models (2312.13873v1)
Abstract: Automatic speech recognition (ASR) has reached a level of accuracy in recent years, that even outperforms humans in transcribing speech to text. Nevertheless, all current ASR approaches show a certain weakness against ambient noise. To reduce this weakness, audio-visual speech recognition (AVSR) approaches additionally consider visual information from lip movements for transcription. This additional modality increases the computational cost for training models from scratch. We propose an approach, that builds on a pre-trained ASR model and extends it with an adaptive upstream module, that fuses audio and visual information. Since we do not need to train the transformer structure from scratch, our approach requires a fraction of the computational resources compared to traditional AVSR models. Compared to current SOTA systems like AV-HuBERT, our approach achieves an average improvement of 8.3% in word error rate across different model sizes, noise categories and broad SNR range. The approach allows up to 21% smaller models and requires only a fraction of the computational resources for training and inference compared to common AVSR approaches.
- “Super-human performance in online low-latency recognition of conversational speech,” in Interspeech, 2020.
- “Robust speech recognition via large-scale weak supervision,” 12 2022.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. PP, pp. 1–1, 10 2021.
- “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020, NIPS’20, Curran Associates Inc.
- “Conformer: Convolution-augmented transformer for speech recognition,” 10 2020, pp. 5036–5040.
- “Spatio-temporal fusion based convolutional sequence learning for lip reading,” 10 2019, pp. 713–722.
- “Relaxed attention for transformer models,” 2023 International Joint Conference on Neural Networks (IJCNN), pp. 1–10, 2022.
- “Audio-visual speech enhancement with a deep kalman filter generative model,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “My lips are concealed: Audio-visual speech enhancement through obstructions,” in Interspeech, 2019.
- “Audio-visual speech enhancement and separation by utilizing multi-modal self-supervised embeddings,” in 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2023, pp. 1–5.
- “Visualvoice: Audio-visual speech separation with cross-modal consistency,” 06 2021, pp. 15490–15500.
- “Vocoder-based speech synthesis from silent videos,” in Interspeech, 2020.
- “Lip2audspec: Speech reconstruction from silent lip movements video,” 04 2018, pp. 2516–2520.
- “Cross-modal audio-visual co-learning for text-independent speaker verification,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Lip reading in profile,” in British Machine Vision Conference, 2017.
- “Auxiliary loss multimodal gru model in audio-visual speech recognition,” IEEE Access, vol. PP, pp. 1–1, 02 2018.
- “Recurrent neural network transducer for audio-visual speech recognition,” 11 2019.
- “Discriminative multi-modality speech recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- “Deep audio-visual speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, pp. 1–1, 12 2018.
- “Jointly learning visual and auditory speech representations from raw data,” in The Eleventh International Conference on Learning Representations, 2023.
- “End-to-end audio-visual speech recognition with conformers,” 06 2021, pp. 7613–7617.
- “Auto-avsr: Audio-visual speech recognition with automatic labels,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Learning audio-visual speech representation by masked multimodal cluster prediction,” In Proceedings of the 10th Inter- national Conference on Learning Representations (ICLR), 2022.
- “Robust Self-Supervised Audio-Visual Speech Recognition,” in Proc. Interspeech 2022, 2022, pp. 2118–2122.
- “Lrs3-ted: a large-scale dataset for visual speech recognition,” 09 2018.
- “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
- “Whisper-at: Noise-robust automatic speech recognizers are also strong audio event taggers,” in Proc. Interspeech 2023, 2023.
- “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019.
- Christopher Simic (2 papers)
- Tobias Bocklet (30 papers)