Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation (2401.09802v2)
Abstract: This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel training strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, we propose to use a visual speech unit that can be obtained by discretizing the visual speech features extracted from the self-supervised visual speech model. Through analysis, we verify that the visual speech units mainly contain viseme information while suppressing non-linguistic information. By using the visual speech units as the inputs of our system, we propose to pre-train a VSR model to predict corresponding text outputs on multilingual data constructed by merging several VSR databases. As both the inputs (i.e., visual speech units) and outputs (i.e., text) are discrete, we can greatly improve the training efficiency compared to the standard VSR training. Specifically, the input data size is reduced to 0.016% of the original video inputs. In order to complement the insufficient visual information in speech recognition, we apply curriculum learning where the inputs of the system begin with audio-visual speech units and gradually change to visual speech units. After pre-training, the model is finetuned on continuous features. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models, with a single trained model.
- “Recent developments on espnet toolkit boosted by conformer” In Proc. ICASSP, 2021, pp. 5874–5878 IEEE
- Kyuhong Shim, Jungwook Choi and Wonyong Sung “Understanding the role of self attention for efficient speech recognition” In International Conference on Learning Representations, 2021
- “End-to-End Speech Recognition: A Survey” In arXiv preprint arXiv:2303.03329, 2023
- “Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18783–18794
- “Adaspeech 4: Adaptive text to speech in zero-shot scenarios” In Proc. Interspeech, 2022
- Jeongsoo Choi, Minsu Kim and Yong Man Ro “Intelligible Lip-to-Speech Synthesis with Speech Units” In Proc. Interspeech, 2023
- “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling” In arXiv preprint arXiv:2303.03926, 2023
- “Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias” In arXiv preprint arXiv:2306.03509, 2023
- “Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks” In arXiv preprint arXiv:2309.07937, 2023
- “Multilingual end-to-end speech translation” In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 570–577 IEEE
- “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation” In International Conference on Machine Learning, 2022, pp. 10120–10134 PMLR
- “Textless Speech-to-Speech Translation on Real Data” In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 860–872
- “Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation” In arXiv preprint arXiv:2308.01831, 2023
- “Massively Multilingual Adversarial Speech Recognition” In Proceedings of NAACL-HLT, 2019, pp. 96–108
- “Robust speech recognition via large-scale weak supervision” In International Conference on Machine Learning, 2023, pp. 28492–28518 PMLR
- “Multilingual speech recognition with a single end-to-end model” In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp. 4904–4908 IEEE
- Florian Lux, Julia Koch and Ngoc Thang Vu “Low-Resource Multilingual and Zero-Shot Multispeaker TTS” In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 2022, pp. 741–751
- Pingchuan Ma, Stavros Petridis and Maja Pantic “Visual speech recognition for multiple languages in the wild” In Nature Machine Intelligence 4.11 Nature Publishing Group UK London, 2022, pp. 930–939
- “Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge” In Proceedings of International Conference on Computer Vision, 2023
- “Visual Speech Recognition for Low-resource Languages with Automatic Labels From Whisper Model” In arXiv preprint arXiv:2309.08535, 2023
- “Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens” In arXiv preprint arXiv:2309.08531, 2023
- Pingchuan Ma, Stavros Petridis and Maja Pantic “End-to-end audio-visual speech recognition with conformers” In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7613–7617 IEEE
- “Mls: A large-scale multilingual dataset for speech research” In Proc. Interspeech, 2020
- “Lipnet: End-to-end sentence-level lipreading” In arXiv preprint arXiv:1611.01599, 2016
- “Deep complementary bottleneck features for visual speech recognition” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2304–2308 IEEE
- Joon Son Chung and Andrew Zisserman “Lip reading in profile” In British Machine Vision Conference, 2017, 2017 British Machine Vision AssociationSociety for Pattern Recognition
- “Towards practical lipreading with distilled and efficient models” In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7608–7612 IEEE
- “Training strategies for improved lip-reading” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8472–8476 IEEE
- “On generative spoken language modeling from raw audio” In Transactions of the Association for Computational Linguistics 9 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2021, pp. 1336–1354
- “wav2vec 2.0: A framework for self-supervised learning of speech representations” In Advances in neural information processing systems 33, 2020, pp. 12449–12460
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units” In IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 IEEE, 2021, pp. 3451–3460
- “Analysing discrete self supervised speech representation for spoken language modeling” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 IEEE
- “Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning” In Proc. Interspeech, 2023
- “Hearing lips: Improving lip reading by distilling speech recognizers” In Proceedings of the AAAI Conference on Artificial Intelligence 34.04, 2020, pp. 6917–6924
- “Learning from the master: Distilling cross-modal advanced knowledge for lip reading” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13325–13333
- “Cromm-vsr: Cross-modal memory augmented visual speech recognition” In IEEE Transactions on Multimedia 24 IEEE, 2021, pp. 4342–4355
- “Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction” In International Conference on Learning Representations, 2021
- “Jointly Learning Visual and Auditory Speech Representations from Raw Data” In The Eleventh International Conference on Learning Representations, 2022
- “Auto-AVSR: Audio-visual speech recognition with automatic labels” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 IEEE
- Joon Son Chung and Andrew Zisserman “Lip reading in the wild” In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, 2017, pp. 87–103 Springer
- “Combining residual networks with LSTMs for lipreading” In Proc. Interspeech, 2017
- Stavros Petridis, Zuwei Li and Maja Pantic “End-to-end visual speech recognition with LSTMs” In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2017, pp. 2592–2596 IEEE
- “End-to-end audiovisual speech recognition” In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp. 6548–6552 IEEE
- “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
- “Empirical evaluation of gated recurrent neural networks on sequence modeling” In NIPS 2014 Workshop on Deep Learning, December 2014, 2014
- “Long short-term memory” In Neural computation 9.8 MIT press, 1997, pp. 1735–1780
- “Lip Reading Sentences in the Wild” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 IEEE
- Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “LRS3-TED: a large-scale dataset for visual speech recognition” In arXiv preprint arXiv:1809.00496, 2018
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- “Deep audio-visual speech recognition” In IEEE transactions on pattern analysis and machine intelligence 44.12 IEEE, 2018, pp. 8717–8727
- “Hybrid CTC/attention architecture for end-to-end speech recognition” In IEEE Journal of Selected Topics in Signal Processing 11.8 IEEE, 2017, pp. 1240–1253
- “Audio-visual speech recognition with a hybrid ctc/attention architecture” In 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 513–520 IEEE
- KR Prajwal, Triantafyllos Afouras and Andrew Zisserman “Sub-word level lip reading with visual attention” In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2022, pp. 5162–5172
- “Conformers are All You Need for Visual Speech Recogntion” In arXiv preprint arXiv:2302.10915, 2023
- Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “Asr is all you need: Cross-modal distillation for lip reading” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 2143–2147 IEEE
- Minsu Kim, Jeong Hun Yeo and Yong Man Ro “Distinguishing homophenes using multi-head visual-audio memory for lip reading” In Proceedings of the AAAI Conference on Artificial Intelligence 36.1, 2022, pp. 1174–1182
- Jeong Hun Yeo, Minsu Kim and Yong Man Ro “Multi-Temporal Lip-Audio Memory for Visual Speech Recognition” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 IEEE
- Geoffrey Hinton, Oriol Vinyals and Jeff Dean “Distilling the knowledge in a neural network” In arXiv preprint arXiv:1503.02531, 2015
- Jason Weston, Sumit Chopra and Antoine Bordes “Memory networks” In 3rd International Conference on Learning Representations, ICLR 2015, 2015
- “Synchronous bidirectional learning for multilingual lip reading” In British Machine Vision Conference, 2020
- “Learning Cross-Lingual Visual Speech Representations” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 IEEE
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing” In IEEE Journal of Selected Topics in Signal Processing 16.6 IEEE, 2022, pp. 1505–1518
- “Speech Resynthesis from Discrete Disentangled Self-Supervised Representations” In Proc. Interspeech, 2021
- “TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation” In The Eleventh International Conference on Learning Representations, 2022
- “Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation” In Proc. Interspeech, 2022
- “Generative spoken dialogue language modeling” In Transactions of the Association for Computational Linguistics 11 MIT Press, 2023, pp. 250–266
- “SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage” In Proc. International Conference on Computer Vision, 2023
- Joon Son Chung, Arsha Nagrani and Andrew Zisserman “VoxCeleb2: Deep Speaker Recognition” In Proc. Interspeech, 2018, pp. 1086–1090
- “Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation” In ACM Transactions on Graphics (TOG) 37.4 ACM New York, NY, USA, 2018, pp. 1–11
- “Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping” In arXiv preprint arXiv:2308.06112, 2023
- “Cross-lingual language model pretraining” In Advances in neural information processing systems 32, 2019
- “The Multilingual TEDx Corpus for Speech Recognition and Translation” In Proc. Interspeech, 2021, pp. 3655–3659
- “Large-scale learning of generalised representations for speaker recognition” In arXiv preprint arXiv:2210.10985, 2022
- “Retinaface: Single-shot multi-level face localisation in the wild” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5203–5212
- “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing” In arXiv preprint arXiv:1808.06226, 2018
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In International Conference on Learning Representations, 2015
- “Unsupervised Cross-lingual Representation Learning at Scale” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440–8451
- “Lifting the Curse of Multilinguality by Pre-training Modular Transformers” In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 3479–3495
- “Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning” In IEEE Transactions on Multimedia IEEE, 2023
- “AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model” In IEEE Transactions on Multimedia, 2024, pp. 1–13 DOI: 10.1109/TMM.2024.3352388
- Brecht Desplanques, Jenthe Thienpondt and Kris Demuynck “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification” In Proc. Interspeech, 2020, pp. 3830–3834 International Speech Communication Association (ISCA)
- Minsu Kim (115 papers)
- Jeong Hun Yeo (12 papers)
- Se Jin Park (15 papers)
- Yong Man Ro (91 papers)
- Hyeongseop Rha (6 papers)