Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper (2309.08535v2)
Abstract: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages
- “End-to-End Speech Recognition: A Survey” In arXiv preprint arXiv:2303.03329, 2023
- “Recent developments on espnet toolkit boosted by conformer” In Proc. ICASSP, 2021, pp. 5874–5878 IEEE
- Jinyu Li “Recent advances in end-to-end automatic speech recognition” In APSIPA Transactions on Signal and Information Processing 11.1 Now Publishers, Inc., 2022
- OpenAI “OpenAI: Introducing ChatGPT”, URL https://openai.com/blog/chatgpt, 2022
- “Deep speech: Scaling up end-to-end speech recognition” In arXiv preprint arXiv:1412.5567, 2014
- “Deep speech 2: End-to-end speech recognition in english and mandarin” In ICML, 2016, pp. 173–182 PMLR
- “Hybrid CTC/attention architecture for end-to-end speech recognition” In IEEE Journal of Selected Topics in Signal Processing 11.8 IEEE, 2017, pp. 1240–1253
- “Conformer: Convolution-augmented transformer for speech recognition” In Proc. Interspeech, 2020
- “Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning” In in Proc. Interspeech, 2023
- “Lipnet: End-to-end sentence-level lipreading” In arXiv preprint arXiv:1611.01599, 2016
- Minsu Kim, Joanna Hong and Yong Man Ro “Lip-to-speech synthesis in the wild with multi-task learning” In Proc. ICASSP, 2023, pp. 1–5 IEEE
- Jeongsoo Choi, Minsu Kim and Yong Man Ro “Intelligible Lip-to-Speech Synthesis with Speech Units” In in Proc. Interspeech, 2023
- “Deep audio-visual speech recognition” In IEEE transactions on pattern analysis and machine intelligence 44.12 IEEE, 2018, pp. 8717–8727
- “End-to-end audiovisual speech recognition” In Proc. ICASSP, 2018, pp. 6548–6552
- Xingxuan Zhang, Feng Cheng and Shilin Wang “Spatio-temporal fusion based convolutional sequence learning for lip reading” In Proc. ICCV, 2019, pp. 713–722
- “Hearing lips: Improving lip reading by distilling speech recognizers” In Proc. AAAI 34.04, 2020, pp. 6917–6924
- Pingchuan Ma, Stavros Petridis and Maja Pantic “End-to-end audio-visual speech recognition with conformers” In Proc. ICASSP, 2021
- Minsu Kim, Jeong Hun Yeo and Yong Man Ro “Distinguishing homophenes using multi-head visual-audio memory for lip reading” In Proc. AAAI 36.1, 2022, pp. 1174–1182
- Jeong Hun Yeo, Minsu Kim and Yong Man Ro “Multi-Temporal Lip-Audio Memory for Visual Speech Recognition” In Proc. ICASSP, 2023, pp. 1–5
- “AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model” In arXiv preprint arXiv:2308.07593, 2023
- “Conformers are All You Need for Visual Speech Recogntion” In arXiv preprint arXiv:2302.10915, 2023
- Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “Asr is all you need: Cross-modal distillation for lip reading” In Proc. ICASSP, 2020, pp. 2143–2147
- “Visual speech recognition in a driver assistance system” In EUSIPCO, 2022, pp. 1131–1135 IEEE
- “Auto-AVSR: Audio-visual speech recognition with automatic labels” In Proc. ICASSP, 2023, pp. 1–5
- “Lip reading sentences in the wild” In Proc. CVPR, 2017, pp. 3444–3453
- “Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge” In in Proc. ICCV, 2023
- Joon Son Chung and Andrew Zisserman “Lip reading in the wild” In Proc. ACCV, 2017, pp. 87–103 Springer
- Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “LRS3-TED: a large-scale dataset for visual speech recognition” In arXiv preprint arXiv:1809.00496, 2018
- “Multilingual TEDx Corpus for Speech Recognition and Translation” In Proc. Interspeech, 2021
- Pingchuan Ma, Stavros Petridis and Maja Pantic “Visual speech recognition for multiple languages in the wild” In Nature Machine Intelligence Nature Publishing Group UK London, 2022, pp. 1–10
- “Learning Cross-Lingual Visual Speech Representations” In Proc. ICASSP, 2023, pp. 1–5
- J Chung, A Nagrani and A Zisserman “VoxCeleb2: Deep speaker recognition” In in Proc. Interspeech International Speech Communication Association, 2018
- “Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation” In ACM Transactions on Graphics (TOG) 37.4 ACM New York, NY, USA, 2018, pp. 1–11
- “Language identification: A tutorial” In IEEE Circuits and Systems Magazine 11.2 IEEE, 2011, pp. 82–108
- “Robust speech recognition via large-scale weak supervision” In Proc. ICML, 2023, pp. 28492–28518
- “Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction” In Proc. ICLR, 2021
- “Deep residual learning for image recognition” In Proc. CVPR, 2016, pp. 770–778
- “Attention is all you need” In NIPS 30, 2017
- “CMU-MOSEAS: A multimodal language dataset for Spanish, Portuguese, German and French” In Proc. EMNLP 2020, 2020, pp. 1801 NIH Public Access
- “Retinaface: Single-shot multi-level face localisation in the wild” In Proc. CVPR, 2020, pp. 5203–5212
- “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing” In arXiv preprint arXiv:1808.06226, 2018
- “Decoupled Weight Decay Regularization” In Proc. ICLR, 2018
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In in Proc. ICLR, 2015
- Jeong Hun Yeo (12 papers)
- Minsu Kim (115 papers)
- Shinji Watanabe (416 papers)
- Yong Man Ro (90 papers)