Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 15 tok/s
GPT-5 High 20 tok/s Pro
GPT-4o 98 tok/s
GPT OSS 120B 460 tok/s Pro
Kimi K2 217 tok/s Pro
2000 character limit reached

Conformers are All You Need for Visual Speech Recognition (2302.10915v2)

Published 17 Feb 2023 in cs.LG, cs.CL, cs.CV, cs.SD, and eess.AS

Abstract: Visual speech recognition models extract visual features in a hierarchical manner. At the lower level, there is a visual front-end with a limited temporal receptive field that processes the raw pixels depicting the lips or faces. At the higher level, there is an encoder that attends to the embeddings produced by the front-end over a large temporal receptive field. Previous work has focused on improving the visual front-end of the model to extract more useful features for speech recognition. Surprisingly, our work shows that complex visual front-ends are not necessary. Instead of allocating resources to a sophisticated visual front-end, we find that a linear visual front-end paired with a larger Conformer encoder results in lower latency, more efficient memory usage, and improved WER performance. We achieve a new state-of-the-art of 12.8% WER for visual speech recognition on the TED LRS3 dataset, which rivals the performance of audio-only models from just four years ago.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. “On robustness to missing video for audiovisual speech recognition,” Transactions on Machine Learning Research, 2022.
  2. “End-to-end multi-person audio/visual automatic speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6994–6998.
  3. “A closer look at audio-visual multi-person speech recognition and active speaker selection,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6863–6867.
  4. “Recurrent neural network transducer for audio-visual speech recognition,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2019, pp. 905–912.
  5. “Transformer-based video front-ends for audio-visual speech recognition,” arXiv preprint arXiv:2201.10439, 2022.
  6. “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  7. “End-to-end audio-visual speech recognition for overlapping speech.,” in Interspeech, 2021, pp. 3016–3020.
  8. “Large-scale visual speech recognition,” arXiv preprint arXiv:1807.05162, 2018.
  9. “Learning audio-visual speech representation by masked multimodal cluster prediction,” arXiv preprint arXiv:2201.02184, 2022.
  10. “A single self-supervised model for many speech modalities enables zero-shot modality transfer,” arXiv preprint arXiv:2207.07036, 2022.
  11. “End-to-end audio-visual speech recognition with conformers,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7613–7617.
  12. “Best of both worlds: Multi-task audio-visual automatic speech recognition and active speaker detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6047–6051.
  13. “Self-supervised learning with random-projection quantizer for speech recognition,” arXiv preprint arXiv:2202.01855, 2022.
  14. “How to train your vit? data, augmentation, and regularization in vision transformers,” arXiv preprint arXiv:2106.10270, 2021.
  15. “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  16. “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  17. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  18. “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
  19. Google, “Artificial Intelligence at Google: Our Principles,” Tech. Rep., Google, 2018.
  20. “Mediapipe: A framework for building perception pipelines,” arXiv preprint arXiv:1906.08172, 2019.
  21. “LRS3-TED: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
  22. “Out of time: automated lip sync in the wild,” in Asian conference on computer vision. Springer, 2016, pp. 251–263.
  23. “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
Citations (14)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.