Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition (2404.02098v1)

Published 2 Apr 2024 in cs.CV

Abstract: Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. “Recurrent neural network transducer for audio-visual speech recognition” In ASRU Workshop, 2019, pp. 905–912
  2. K. R. Prajwal, Triantafyllos Afouras and Andrew Zisserman “Sub-word Level Lip Reading With Visual Attention” In CVPR, 2022
  3. Dmitriy Serdyuk, Otavio Braga and Olivier Siohan “Audio-Visual Speech Recognition is Worth 32x32x8 Voxels” In ASRU Workshop, 2021
  4. Dmitriy Serdyuk, Otavio Braga and Olivier Siohan “Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Muti-Person Video” In Interspeech ISCA, 2022, pp. 2833–2837 DOI: 10.21437/Interspeech.2022-10920
  5. Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “LRS3-TED: a large-scale dataset for visual speech recognition” In arXiv, 2018
  6. Pingchuan Ma, Stavros Petridis and Maja Pantic “Visual speech recognition for multiple languages in the wild” In Nature Machine Intelligence 4.11, 2022, pp. 930–939 DOI: 10.1038/s42256-022-00550-z
  7. “Auto-AVSR: Audio-visual speech recognition with automatic labels” In ICASSP, 2023, pp. 1–5 IEEE
  8. Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “Asr is all you need: Cross-modal distillation for lip reading” In ICASSP, 2020, pp. 2143–2147
  9. “SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision” In CVPR, 2023, pp. 18806–18815
  10. “Jointly Learning Visual and Auditory Speech Representations from Raw Data” In ICLR OpenReview.net, 2023 URL: https://openreview.net/pdf?id=BPwIgvf5iQ
  11. “LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision” In Interspeech, 2021, pp. 3011–3015
  12. “Learning audio-visual speech representation by masked multimodal cluster prediction” In ICLR, 2022
  13. “Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning” In IEEE Trans. Multimed. IEEE, 2023
  14. “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language” In ICML 162 PMLR, 2022, pp. 1298–1312 URL: https://proceedings.mlr.press/v162/baevski22a.html
  15. “Multi-Task Self-Supervised Learning for Robust Speech Recognition” In ICASSP, 2020, pp. 6989–6993
  16. “AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations” In arXiv, 2023
  17. Joon Son Chung, Arsha Nagrani and Andrew Zisserman “VoxCeleb2: Deep Speaker Recognition” In Interspeech, 2018, pp. 1086–1090
  18. “Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation” In ACM Trans. Graph. 37.4, 2018, pp. 112 DOI: 10.1145/3197517.3201357
  19. “Bootstrap your own latent-a new approach to self-supervised learning” In NeurIPS 33, 2020, pp. 21271–21284
  20. Dmitry Ulyanov, Andrea Vedaldi and Victor Lempitsky “Instance normalization: The missing ingredient for fast stylization” In arXiv preprint arXiv:1607.08022, 2016
  21. “Emerging properties in self-supervised vision transformers” In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9650–9660
  22. Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization” In arXiv preprint arXiv:1607.06450, 2016
  23. “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing” In ACL, 2018, pp. 66–71
  24. “Attention is all you need” In NeurIPS, 2017, pp. 5998–6008
  25. “Deep residual learning for image recognition” In CVPR, 2016, pp. 770–778
  26. “Combining Residual Networks with LSTMs for Lipreading” In Interspeech 9, 2017, pp. 3652–3656
  27. “Decoupled weight decay regularization” In ICLR, 2019
  28. “Deep networks with stochastic depth” In Proceedings of the 14th European Conference on Computer Vision (ECCV), 2016, pp. 646–661
  29. “Large-Scale Visual Speech Recognition” In Interspeech, 2019, pp. 4135–4139
  30. “u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality” In NeurIPS 35, 2022, pp. 21157–21170
  31. Pingchuan Ma, Stavros Petridis and Maja Pantic “End-to-end audio-visual speech recognition with conformers” In ICASSP, 2021, pp. 7613–7617
  32. Andrew Varga and Herman JM Steeneken “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems” In Speech communication 12.3 Elsevier, 1993, pp. 247–251
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Alexandros Haliassos (10 papers)
  2. Andreas Zinonos (2 papers)
  3. Rodrigo Mira (13 papers)
  4. Stavros Petridis (64 papers)
  5. Maja Pantic (100 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com