Conformers are All You Need for Visual Speech Recognition (2302.10915v2)
Abstract: Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level, there is an encoder that attends to the embeddings produced by the front-end over a large temporal receptive field. Previous work has focused on improving the visual front-end of the model to extract more useful features for speech recognition. Surprisingly, our work shows that complex visual front-ends are not necessary. Instead of allocating resources to a sophisticated visual front-end, we find that a linear visual front-end paired with a larger Conformer encoder results in lower latency, more efficient memory usage, and improved WER performance. We achieve a new state-of-the-art of 12.8% WER for visual speech recognition on the TED LRS3 dataset, which rivals the performance of audio-only models from just four years ago.
- “On robustness to missing video for audiovisual speech recognition,” Transactions on Machine Learning Research, 2022.
- “End-to-end multi-person audio/visual automatic speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6994–6998.
- “A closer look at audio-visual multi-person speech recognition and active speaker selection,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6863–6867.
- “Recurrent neural network transducer for audio-visual speech recognition,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2019, pp. 905–912.
- “Transformer-based video front-ends for audio-visual speech recognition,” arXiv preprint arXiv:2201.10439, 2022.
- “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018.
- “End-to-end audio-visual speech recognition for overlapping speech.,” in Interspeech, 2021, pp. 3016–3020.
- “Large-scale visual speech recognition,” arXiv preprint arXiv:1807.05162, 2018.
- “Learning audio-visual speech representation by masked multimodal cluster prediction,” arXiv preprint arXiv:2201.02184, 2022.
- “A single self-supervised model for many speech modalities enables zero-shot modality transfer,” arXiv preprint arXiv:2207.07036, 2022.
- “End-to-end audio-visual speech recognition with conformers,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7613–7617.
- “Best of both worlds: Multi-task audio-visual automatic speech recognition and active speaker detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6047–6051.
- “Self-supervised learning with random-projection quantizer for speech recognition,” arXiv preprint arXiv:2202.01855, 2022.
- “How to train your vit? data, augmentation, and regularization in vision transformers,” arXiv preprint arXiv:2106.10270, 2021.
- “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
- Google, “Artificial Intelligence at Google: Our Principles,” Tech. Rep., Google, 2018.
- “Mediapipe: A framework for building perception pipelines,” arXiv preprint arXiv:1906.08172, 2019.
- “LRS3-TED: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
- “Out of time: automated lip sync in the wild,” in Asian conference on computer vision. Springer, 2016, pp. 251–263.
- “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.