Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation (2401.09802v2)

Published 18 Jan 2024 in eess.AS, cs.CV, and cs.SD

Abstract: This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel training strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, we propose to use a visual speech unit that can be obtained by discretizing the visual speech features extracted from the self-supervised visual speech model. Through analysis, we verify that the visual speech units mainly contain viseme information while suppressing non-linguistic information. By using the visual speech units as the inputs of our system, we propose to pre-train a VSR model to predict corresponding text outputs on multilingual data constructed by merging several VSR databases. As both the inputs (i.e., visual speech units) and outputs (i.e., text) are discrete, we can greatly improve the training efficiency compared to the standard VSR training. Specifically, the input data size is reduced to 0.016% of the original video inputs. In order to complement the insufficient visual information in speech recognition, we apply curriculum learning where the inputs of the system begin with audio-visual speech units and gradually change to visual speech units. After pre-training, the model is finetuned on continuous features. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models, with a single trained model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. “Recent developments on espnet toolkit boosted by conformer” In Proc. ICASSP, 2021, pp. 5874–5878 IEEE
  2. Kyuhong Shim, Jungwook Choi and Wonyong Sung “Understanding the role of self attention for efficient speech recognition” In International Conference on Learning Representations, 2021
  3. “End-to-End Speech Recognition: A Survey” In arXiv preprint arXiv:2303.03329, 2023
  4. “Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18783–18794
  5. “Adaspeech 4: Adaptive text to speech in zero-shot scenarios” In Proc. Interspeech, 2022
  6. Jeongsoo Choi, Minsu Kim and Yong Man Ro “Intelligible Lip-to-Speech Synthesis with Speech Units” In Proc. Interspeech, 2023
  7. “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling” In arXiv preprint arXiv:2303.03926, 2023
  8. “Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias” In arXiv preprint arXiv:2306.03509, 2023
  9. “Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks” In arXiv preprint arXiv:2309.07937, 2023
  10. “Multilingual end-to-end speech translation” In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 570–577 IEEE
  11. “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation” In International Conference on Machine Learning, 2022, pp. 10120–10134 PMLR
  12. “Textless Speech-to-Speech Translation on Real Data” In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 860–872
  13. “Many-to-Many Spoken Language Translation via Unified Speech and Text Representation Learning with Unit-to-Unit Translation” In arXiv preprint arXiv:2308.01831, 2023
  14. “Massively Multilingual Adversarial Speech Recognition” In Proceedings of NAACL-HLT, 2019, pp. 96–108
  15. “Robust speech recognition via large-scale weak supervision” In International Conference on Machine Learning, 2023, pp. 28492–28518 PMLR
  16. “Multilingual speech recognition with a single end-to-end model” In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp. 4904–4908 IEEE
  17. Florian Lux, Julia Koch and Ngoc Thang Vu “Low-Resource Multilingual and Zero-Shot Multispeaker TTS” In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 2022, pp. 741–751
  18. Pingchuan Ma, Stavros Petridis and Maja Pantic “Visual speech recognition for multiple languages in the wild” In Nature Machine Intelligence 4.11 Nature Publishing Group UK London, 2022, pp. 930–939
  19. “Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge” In Proceedings of International Conference on Computer Vision, 2023
  20. “Visual Speech Recognition for Low-resource Languages with Automatic Labels From Whisper Model” In arXiv preprint arXiv:2309.08535, 2023
  21. “Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens” In arXiv preprint arXiv:2309.08531, 2023
  22. Pingchuan Ma, Stavros Petridis and Maja Pantic “End-to-end audio-visual speech recognition with conformers” In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7613–7617 IEEE
  23. “Mls: A large-scale multilingual dataset for speech research” In Proc. Interspeech, 2020
  24. “Lipnet: End-to-end sentence-level lipreading” In arXiv preprint arXiv:1611.01599, 2016
  25. “Deep complementary bottleneck features for visual speech recognition” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 2304–2308 IEEE
  26. Joon Son Chung and Andrew Zisserman “Lip reading in profile” In British Machine Vision Conference, 2017, 2017 British Machine Vision AssociationSociety for Pattern Recognition
  27. “Towards practical lipreading with distilled and efficient models” In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7608–7612 IEEE
  28. “Training strategies for improved lip-reading” In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 8472–8476 IEEE
  29. “On generative spoken language modeling from raw audio” In Transactions of the Association for Computational Linguistics 9 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2021, pp. 1336–1354
  30. “wav2vec 2.0: A framework for self-supervised learning of speech representations” In Advances in neural information processing systems 33, 2020, pp. 12449–12460
  31. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units” In IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 IEEE, 2021, pp. 3451–3460
  32. “Analysing discrete self supervised speech representation for spoken language modeling” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 IEEE
  33. “Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning” In Proc. Interspeech, 2023
  34. “Hearing lips: Improving lip reading by distilling speech recognizers” In Proceedings of the AAAI Conference on Artificial Intelligence 34.04, 2020, pp. 6917–6924
  35. “Learning from the master: Distilling cross-modal advanced knowledge for lip reading” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13325–13333
  36. “Cromm-vsr: Cross-modal memory augmented visual speech recognition” In IEEE Transactions on Multimedia 24 IEEE, 2021, pp. 4342–4355
  37. “Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction” In International Conference on Learning Representations, 2021
  38. “Jointly Learning Visual and Auditory Speech Representations from Raw Data” In The Eleventh International Conference on Learning Representations, 2022
  39. “Auto-AVSR: Audio-visual speech recognition with automatic labels” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 IEEE
  40. Joon Son Chung and Andrew Zisserman “Lip reading in the wild” In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, 2017, pp. 87–103 Springer
  41. “Combining residual networks with LSTMs for lipreading” In Proc. Interspeech, 2017
  42. Stavros Petridis, Zuwei Li and Maja Pantic “End-to-end visual speech recognition with LSTMs” In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2017, pp. 2592–2596 IEEE
  43. “End-to-end audiovisual speech recognition” In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp. 6548–6552 IEEE
  44. “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
  45. “Empirical evaluation of gated recurrent neural networks on sequence modeling” In NIPS 2014 Workshop on Deep Learning, December 2014, 2014
  46. “Long short-term memory” In Neural computation 9.8 MIT press, 1997, pp. 1735–1780
  47. “Lip Reading Sentences in the Wild” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017 IEEE
  48. Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “LRS3-TED: a large-scale dataset for visual speech recognition” In arXiv preprint arXiv:1809.00496, 2018
  49. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  50. “Deep audio-visual speech recognition” In IEEE transactions on pattern analysis and machine intelligence 44.12 IEEE, 2018, pp. 8717–8727
  51. “Hybrid CTC/attention architecture for end-to-end speech recognition” In IEEE Journal of Selected Topics in Signal Processing 11.8 IEEE, 2017, pp. 1240–1253
  52. “Audio-visual speech recognition with a hybrid ctc/attention architecture” In 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 513–520 IEEE
  53. KR Prajwal, Triantafyllos Afouras and Andrew Zisserman “Sub-word level lip reading with visual attention” In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2022, pp. 5162–5172
  54. “Conformers are All You Need for Visual Speech Recogntion” In arXiv preprint arXiv:2302.10915, 2023
  55. Triantafyllos Afouras, Joon Son Chung and Andrew Zisserman “Asr is all you need: Cross-modal distillation for lip reading” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 2143–2147 IEEE
  56. Minsu Kim, Jeong Hun Yeo and Yong Man Ro “Distinguishing homophenes using multi-head visual-audio memory for lip reading” In Proceedings of the AAAI Conference on Artificial Intelligence 36.1, 2022, pp. 1174–1182
  57. Jeong Hun Yeo, Minsu Kim and Yong Man Ro “Multi-Temporal Lip-Audio Memory for Visual Speech Recognition” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 IEEE
  58. Geoffrey Hinton, Oriol Vinyals and Jeff Dean “Distilling the knowledge in a neural network” In arXiv preprint arXiv:1503.02531, 2015
  59. Jason Weston, Sumit Chopra and Antoine Bordes “Memory networks” In 3rd International Conference on Learning Representations, ICLR 2015, 2015
  60. “Synchronous bidirectional learning for multilingual lip reading” In British Machine Vision Conference, 2020
  61. “Learning Cross-Lingual Visual Speech Representations” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 IEEE
  62. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing” In IEEE Journal of Selected Topics in Signal Processing 16.6 IEEE, 2022, pp. 1505–1518
  63. “Speech Resynthesis from Discrete Disentangled Self-Supervised Representations” In Proc. Interspeech, 2021
  64. “TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation” In The Eleventh International Conference on Learning Representations, 2022
  65. “Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation” In Proc. Interspeech, 2022
  66. “Generative spoken dialogue language modeling” In Transactions of the Association for Computational Linguistics 11 MIT Press, 2023, pp. 250–266
  67. “SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage” In Proc. International Conference on Computer Vision, 2023
  68. Joon Son Chung, Arsha Nagrani and Andrew Zisserman “VoxCeleb2: Deep Speaker Recognition” In Proc. Interspeech, 2018, pp. 1086–1090
  69. “Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation” In ACM Transactions on Graphics (TOG) 37.4 ACM New York, NY, USA, 2018, pp. 1–11
  70. “Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping” In arXiv preprint arXiv:2308.06112, 2023
  71. “Cross-lingual language model pretraining” In Advances in neural information processing systems 32, 2019
  72. “The Multilingual TEDx Corpus for Speech Recognition and Translation” In Proc. Interspeech, 2021, pp. 3655–3659
  73. “Large-scale learning of generalised representations for speaker recognition” In arXiv preprint arXiv:2210.10985, 2022
  74. “Retinaface: Single-shot multi-level face localisation in the wild” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5203–5212
  75. “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing” In arXiv preprint arXiv:1808.06226, 2018
  76. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In International Conference on Learning Representations, 2015
  77. “Unsupervised Cross-lingual Representation Learning at Scale” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440–8451
  78. “Lifting the Curse of Multilinguality by Pre-training Modular Transformers” In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 3479–3495
  79. “Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning” In IEEE Transactions on Multimedia IEEE, 2023
  80. “AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model” In IEEE Transactions on Multimedia, 2024, pp. 1–13 DOI: 10.1109/TMM.2024.3352388
  81. Brecht Desplanques, Jenthe Thienpondt and Kris Demuynck “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN based speaker verification” In Proc. Interspeech, 2020, pp. 3830–3834 International Speech Communication Association (ISCA)
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Minsu Kim (115 papers)
  2. Jeong Hun Yeo (12 papers)
  3. Se Jin Park (15 papers)
  4. Yong Man Ro (90 papers)
  5. Hyeongseop Rha (6 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com