Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper (2309.08535v2)

Published 15 Sep 2023 in cs.CV, cs.AI, and eess.AS

Abstract: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. “End-to-End Speech Recognition: A Survey” In arXiv preprint arXiv:2303.03329, 2023
  2. “Recent developments on espnet toolkit boosted by conformer” In Proc. ICASSP, 2021, pp. 5874–5878 IEEE
  3. Jinyu Li “Recent advances in end-to-end automatic speech recognition” In APSIPA Transactions on Signal and Information Processing 11.1 Now Publishers, Inc., 2022
  4. OpenAI “OpenAI: Introducing ChatGPT”, URL https://openai.com/blog/chatgpt, 2022
  5. “Deep speech: Scaling up end-to-end speech recognition” In arXiv preprint arXiv:1412.5567, 2014
  6. “Deep speech 2: End-to-end speech recognition in english and mandarin” In ICML, 2016, pp. 173–182 PMLR
  7. “Hybrid CTC/attention architecture for end-to-end speech recognition” In IEEE Journal of Selected Topics in Signal Processing 11.8 IEEE, 2017, pp. 1240–1253
  8. “Conformer: Convolution-augmented transformer for speech recognition” In Proc. Interspeech, 2020
  9. “Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning” In in Proc. Interspeech, 2023
  10. “Lipnet: End-to-end sentence-level lipreading” In arXiv preprint arXiv:1611.01599, 2016
  11. Minsu Kim, Joanna Hong and Yong Man Ro “Lip-to-speech synthesis in the wild with multi-task learning” In Proc. ICASSP, 2023, pp. 1–5 IEEE
  12. Jeongsoo Choi, Minsu Kim and Yong Man Ro “Intelligible Lip-to-Speech Synthesis with Speech Units” In in Proc. Interspeech, 2023
  13. “Deep audio-visual speech recognition” In IEEE transactions on pattern analysis and machine intelligence 44.12 IEEE, 2018, pp. 8717–8727
  14. “End-to-end audiovisual speech recognition” In Proc. ICASSP, 2018, pp. 6548–6552
  15. Xingxuan Zhang, Feng Cheng and Shilin Wang “Spatio-temporal fusion based convolutional sequence learning for lip reading” In Proc. ICCV, 2019, pp. 713–722
  16. “Hearing lips: Improving lip reading by distilling speech recognizers” In Proc. AAAI 34.04, 2020, pp. 6917–6924
  17. Pingchuan Ma, Stavros Petridis and Maja Pantic “End-to-end audio-visual speech recognition with conformers” In Proc. ICASSP, 2021
  18. Minsu Kim, Jeong Hun Yeo and Yong Man Ro “Distinguishing homophenes using multi-head visual-audio memory for lip reading” In Proc. AAAI 36.1, 2022, pp. 1174–1182
  19. Jeong Hun Yeo, Minsu Kim and Yong Man Ro “Multi-Temporal Lip-Audio Memory for Visual Speech Recognition” In Proc. ICASSP, 2023, pp. 1–5
  20. “AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model” In arXiv preprint arXiv:2308.07593, 2023
  21. “Conformers are All You Need for Visual Speech Recogntion” In arXiv preprint arXiv:2302.10915, 2023
  22. Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “Asr is all you need: Cross-modal distillation for lip reading” In Proc. ICASSP, 2020, pp. 2143–2147
  23. “Visual speech recognition in a driver assistance system” In EUSIPCO, 2022, pp. 1131–1135 IEEE
  24. “Auto-AVSR: Audio-visual speech recognition with automatic labels” In Proc. ICASSP, 2023, pp. 1–5
  25. “Lip reading sentences in the wild” In Proc. CVPR, 2017, pp. 3444–3453
  26. “Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge” In in Proc. ICCV, 2023
  27. Joon Son Chung and Andrew Zisserman “Lip reading in the wild” In Proc. ACCV, 2017, pp. 87–103 Springer
  28. Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “LRS3-TED: a large-scale dataset for visual speech recognition” In arXiv preprint arXiv:1809.00496, 2018
  29. “Multilingual TEDx Corpus for Speech Recognition and Translation” In Proc. Interspeech, 2021
  30. Pingchuan Ma, Stavros Petridis and Maja Pantic “Visual speech recognition for multiple languages in the wild” In Nature Machine Intelligence Nature Publishing Group UK London, 2022, pp. 1–10
  31. “Learning Cross-Lingual Visual Speech Representations” In Proc. ICASSP, 2023, pp. 1–5
  32. J Chung, A Nagrani and A Zisserman “VoxCeleb2: Deep speaker recognition” In in Proc. Interspeech International Speech Communication Association, 2018
  33. “Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation” In ACM Transactions on Graphics (TOG) 37.4 ACM New York, NY, USA, 2018, pp. 1–11
  34. “Language identification: A tutorial” In IEEE Circuits and Systems Magazine 11.2 IEEE, 2011, pp. 82–108
  35. “Robust speech recognition via large-scale weak supervision” In Proc. ICML, 2023, pp. 28492–28518
  36. “Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction” In Proc. ICLR, 2021
  37. “Deep residual learning for image recognition” In Proc. CVPR, 2016, pp. 770–778
  38. “Attention is all you need” In NIPS 30, 2017
  39. “CMU-MOSEAS: A multimodal language dataset for Spanish, Portuguese, German and French” In Proc. EMNLP 2020, 2020, pp. 1801 NIH Public Access
  40. “Retinaface: Single-shot multi-level face localisation in the wild” In Proc. CVPR, 2020, pp. 5203–5212
  41. “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing” In arXiv preprint arXiv:1808.06226, 2018
  42. “Decoupled Weight Decay Regularization” In Proc. ICLR, 2018
  43. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In in Proc. ICLR, 2015
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jeong Hun Yeo (12 papers)
  2. Minsu Kim (115 papers)
  3. Shinji Watanabe (416 papers)
  4. Yong Man Ro (90 papers)
Citations (6)