BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition (2404.02098v1)
Abstract: Self-supervision has recently shown great promise for learning visual and auditory speech representations from unlabelled data. In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods in various settings. Moreover, we observe favourable scaling behaviour by increasing the amount of unlabelled data well beyond other self-supervised works. In particular, we achieve 20.0% / 1.7% word error rate for VSR / ASR on the LRS3 test set, with only 30 hours of labelled data and no external ASR models. Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
- “Recurrent neural network transducer for audio-visual speech recognition” In ASRU Workshop, 2019, pp. 905–912
- K. R. Prajwal, Triantafyllos Afouras and Andrew Zisserman “Sub-word Level Lip Reading With Visual Attention” In CVPR, 2022
- Dmitriy Serdyuk, Otavio Braga and Olivier Siohan “Audio-Visual Speech Recognition is Worth 32x32x8 Voxels” In ASRU Workshop, 2021
- Dmitriy Serdyuk, Otavio Braga and Olivier Siohan “Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Muti-Person Video” In Interspeech ISCA, 2022, pp. 2833–2837 DOI: 10.21437/Interspeech.2022-10920
- Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “LRS3-TED: a large-scale dataset for visual speech recognition” In arXiv, 2018
- Pingchuan Ma, Stavros Petridis and Maja Pantic “Visual speech recognition for multiple languages in the wild” In Nature Machine Intelligence 4.11, 2022, pp. 930–939 DOI: 10.1038/s42256-022-00550-z
- “Auto-AVSR: Audio-visual speech recognition with automatic labels” In ICASSP, 2023, pp. 1–5 IEEE
- Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “Asr is all you need: Cross-modal distillation for lip reading” In ICASSP, 2020, pp. 2143–2147
- “SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision” In CVPR, 2023, pp. 18806–18815
- “Jointly Learning Visual and Auditory Speech Representations from Raw Data” In ICLR OpenReview.net, 2023 URL: https://openreview.net/pdf?id=BPwIgvf5iQ
- “LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision” In Interspeech, 2021, pp. 3011–3015
- “Learning audio-visual speech representation by masked multimodal cluster prediction” In ICLR, 2022
- “Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning” In IEEE Trans. Multimed. IEEE, 2023
- “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language” In ICML 162 PMLR, 2022, pp. 1298–1312 URL: https://proceedings.mlr.press/v162/baevski22a.html
- “Multi-Task Self-Supervised Learning for Robust Speech Recognition” In ICASSP, 2020, pp. 6989–6993
- “AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations” In arXiv, 2023
- Joon Son Chung, Arsha Nagrani and Andrew Zisserman “VoxCeleb2: Deep Speaker Recognition” In Interspeech, 2018, pp. 1086–1090
- “Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation” In ACM Trans. Graph. 37.4, 2018, pp. 112 DOI: 10.1145/3197517.3201357
- “Bootstrap your own latent-a new approach to self-supervised learning” In NeurIPS 33, 2020, pp. 21271–21284
- Dmitry Ulyanov, Andrea Vedaldi and Victor Lempitsky “Instance normalization: The missing ingredient for fast stylization” In arXiv preprint arXiv:1607.08022, 2016
- “Emerging properties in self-supervised vision transformers” In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9650–9660
- Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization” In arXiv preprint arXiv:1607.06450, 2016
- “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing” In ACL, 2018, pp. 66–71
- “Attention is all you need” In NeurIPS, 2017, pp. 5998–6008
- “Deep residual learning for image recognition” In CVPR, 2016, pp. 770–778
- “Combining Residual Networks with LSTMs for Lipreading” In Interspeech 9, 2017, pp. 3652–3656
- “Decoupled weight decay regularization” In ICLR, 2019
- “Deep networks with stochastic depth” In Proceedings of the 14th European Conference on Computer Vision (ECCV), 2016, pp. 646–661
- “Large-Scale Visual Speech Recognition” In Interspeech, 2019, pp. 4135–4139
- “u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality” In NeurIPS 35, 2022, pp. 21157–21170
- Pingchuan Ma, Stavros Petridis and Maja Pantic “End-to-end audio-visual speech recognition with conformers” In ICASSP, 2021, pp. 7613–7617
- Andrew Varga and Herman JM Steeneken “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems” In Speech communication 12.3 Elsevier, 1993, pp. 247–251
- Alexandros Haliassos (10 papers)
- Andreas Zinonos (2 papers)
- Rodrigo Mira (13 papers)
- Stavros Petridis (64 papers)
- Maja Pantic (100 papers)