VILAS: Exploring the Effects of Vision and Language Context in Automatic Speech Recognition (2305.19972v2)
Abstract: Enhancing automatic speech recognition (ASR) performance by leveraging additional multimodal information has shown promising results in previous studies. However, most of these works have primarily focused on utilizing visual cues derived from human lip motions. In fact, context-dependent visual and linguistic cues can also benefit in many scenarios. In this paper, we first propose ViLaS (Vision and Language into Automatic Speech Recognition), a novel multimodal ASR model based on the continuous integrate-and-fire (CIF) mechanism, which can integrate visual and textual context simultaneously or separately, to facilitate speech recognition. Next, we introduce an effective training strategy that improves performance in modal-incomplete test scenarios. Then, to explore the effects of integrating vision and language, we create VSDial, a multimodal ASR dataset with multimodal context cues in both Chinese and English versions. Finally, empirical results are reported on the public Flickr8K and self-constructed VSDial datasets. We explore various cross-modal fusion schemes, analyze fine-grained crossmodal alignment on VSDial, and provide insights into the effects of integrating multimodal information on speech recognition.
- Triantafyllos Afouras et al., “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12, pp. 8717–8727, 2018.
- “Open-domain audio-visual speech recognition: A deep learning approach.,” in INTERSPEECH, 2016.
- Abhinav Gupta et al., “Visual features for context-aware speech recognition,” in ICASSP. IEEE, 2017, pp. 5020–5024.
- Abhishek Das et al., “Visual dialog,” in CVPR, 2017.
- Ding Zhao et al., “Shallow-fusion end-to-end contextual biasing.,” in INTERSPEECH, 2019.
- Duc Le et al., “Deep shallow fusion for rnn-t personalization,” in SLT. IEEE, 2021, pp. 251–257.
- Yasufumi Moriya et al., “LSTM language model adaptation with images and titles for multimedia automatic speech recognition,” in SLT. 2018, IEEE.
- “Unified subspace learning for incomplete and unlabeled multi-view data,” Pattern Recognition, vol. 67, pp. 313–327, 2017.
- Mengmeng Ma et al., “Smil: Multimodal learning with severely missing modality,” in AAAI, 2021.
- Mengmeng Ma et al., “Are multimodal transformers robust to missing modality?,” in CVPR, 2022.
- Linhao Dong et al., “Cif: Continuous integrate-and-fire for end-to-end speech recognition,” in ICASSP. IEEE, 2020.
- Bo Xu et al., “Discriminative multi-modality speech recognition,” in CVPR, 2020, pp. 14433–14442.
- Bowen Shi et al., “Robust self-supervised audio-visual speech recognition,” in INTERSPEECH, 2022.
- Felix Sun et al., “Look, listen, and decode: Multimodal speech recognition with images,” in SLT. IEEE, 2016, pp. 573–578.
- Tejas Srinivasan et al., “Multimodal speech recognition with unstructured audio masking,” arXiv preprint arXiv:2010.08642, 2020.
- Tejas Srinivasan et al., “Fine-grained grounding for multimodal speech recognition,” in EMNLP (Findings). 2020, ACL.
- Shahram Ghorbani et al., “Listen, look and deliberate: Visual context-aware speech recognition using pre-trained text-video representations,” in SLT. 2021, IEEE.
- “Can visual context improve automatic speech recognition for an embodied agent?,” in EMNLP, 2022.
- Shubham Toshniwal et al., “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in SLT. IEEE, 2018, pp. 369–375.
- Anjuli Kannan et al., “An analysis of incorporating an external language model into a sequence-to-sequence model,” in ICASSP. IEEE, 2018.
- Anuroop Sriram et al., “Cold fusion: Training seq2seq models together with language models,” in INTERSPEECH, 2018.
- Hayato Futami et al., “Distilling the knowledge of BERT for sequence-to-sequence ASR,” in INTERSPEECH, 2020.
- “Knowledge transfer from large-scale pretrained language models to end-to-end speech recognizers,” in ICASSP. IEEE, 2022, pp. 8512–8516.
- Minglun Han et al., “Knowledge transfer from pre-trained language models to cif-based speech recognizers via hierarchical distillation,” in INTERSPEECH, 2023.
- Liyan Xu et al., “Rescorebert: Discriminative speech recognition rescoring with bert,” in ICASSP. IEEE, 2022.
- Golan Pundak et al., “Deep context: end-to-end contextual speech recognition,” in SLT. IEEE, 2018, pp. 418–425.
- Mahaveer Jain et al., “Contextual RNN-T for open domain ASR,” in INTERSPEECH. 2020, pp. 11–15, ISCA.
- Minglun Han et al., “Cif-based collaborative decoding for end-to-end contextual speech recognition,” in ICASSP. IEEE, 2021.
- Minglun Han et al., “Improving end-to-end contextual speech recognition with fine-grained contextual knowledge selection,” in ICASSP. IEEE, 2022.
- Suyoun Kim et al., “Gated embeddings in end-to-end speech recognition for conversational-context fusion,” in ACL. 2019, ACL.
- Junfeng Hou et al., “Bring dialogue-context into rnn-t for streaming asr,” in INTERSPEECH, 2022.
- Alexey Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
- Jacob Devlin et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, 2019, pp. 4171–4186.
- Gabeur Valentin et al., “Avatar: Unconstrained audiovisual speech recognition,” arXiv preprint arXiv:2206.07684, 2022.
- Hui Bu et al., “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in O-COCOSDA. IEEE, 2017, pp. 1–5.
- Vassil Panayotov et al., “Librispeech: An asr corpus based on public domain audio books,” in ICASSP, 2015.
- Micah Hodosh et al., “Framing image description as a ranking task: Data, models and evaluation metrics,” Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, 2013.
- “Deep multimodal semantic embeddings for speech and images,” in ASRU, 2015.
- Yi Ren et al., “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in ICLR, 2021.
- “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
- Yao Shi et al., “Aishell-3: A multi-speaker mandarin tts corpus and the baselines,” arXiv preprint arXiv:2010.11567, 2020.
- Rico Sennrich et al., “Neural machine translation of rare words with subword units,” in ACL. 2016, ACL.
- Anmol Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in INTERSPEECH, 2020.
- Thomas Wolf et al., “Transformers: State-of-the-art natural language processing,” in EMNLP. 2020, pp. 38–45, ACL.
- Daniel S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” in INTERSPEECH, 2019.
- “Adam: A method for stochastic optimization,” in ICLR, 2015.
- “Improving multimodal speech recognition by data augmentation and speech representations,” in CVPR Workshops, 2022, pp. 4579–4588.