Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model (2308.07593v2)

Published 15 Aug 2023 in cs.CV, cs.MM, eess.AS, and eess.IV

Abstract: Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221–231, 2012.
  2. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” pp. 5998–6008, 2017.
  3. K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” arXiv preprint arXiv:1409.1259, 2014.
  4. D. Ivanko, D. Ryumin, A. Kashevnik, A. Axyonov, and A. Karnov, “Visual speech recognition in a driver assistance system,” in 2022 30th European Signal Processing Conference (EUSIPCO).   IEEE, 2022, pp. 1131–1135.
  5. C. Sheng, L. Liu, W. Deng, L. Bai, Z. Liu, S. Lao, G. Kuang, and M. Pietikäinen, “Importance-aware information bottleneck learning paradigm for lip reading,” IEEE Transactions on Multimedia, 2022.
  6. C. Sheng, X. Zhu, H. Xu, M. Pietikäinen, and L. Liu, “Adaptive semantic-spatio-temporal graph convolutional network for lip reading,” IEEE Transactions on Multimedia, vol. 24, pp. 3545–3557, 2021.
  7. T. Saitoh, Z. Zhou, G. Zhao, and M. Pietikäinen, “Concatenated frame image based cnn for visual speech recognition,” in Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13.   Springer, 2017, pp. 277–289.
  8. M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “Audio-visual speech enhancement using conditional variational auto-encoders,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1788–1800, 2020.
  9. H. Akbari, H. Arora, L. Cao, and N. Mesgarani, “Lip2audspec: Speech reconstruction from silent lip movements video,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 2516–2520.
  10. T. Stafylakis and G. Tzimiropoulos, “Combining residual networks with lstms for lipreading,” arXiv preprint arXiv:1703.04105, 2017.
  11. S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, “End-to-end audiovisual speech recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2018, pp. 6548–6552.
  12. X. Zhao, S. Yang, S. Shan, and X. Chen, “Mutual information maximization for effective lip reading,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).   IEEE, 2020, pp. 420–427.
  13. T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, “Recurrent neural network based language model.” in Interspeech, vol. 2, no. 3.   Makuhari, 2010, pp. 1045–1048.
  14. T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  15. P. Ma, S. Petridis, and M. Pantic, “End-to-end audio-visual speech recognition with conformers,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 7613–7617.
  16. M. P. Pingchuan Ma, Stavros Petridis, “Visual speech recognition for multiple languages in the wild,” Nature Machine Intelligence, pp. 1–10, 2022.
  17. B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio-visual speech representation by masked multimodal cluster prediction,” arXiv preprint arXiv:2201.02184, 2022.
  18. R. T. Sataloff, “The human voice,” Scientific American, vol. 267, no. 6, pp. 108–115, 1992.
  19. G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
  20. Y. Zhao, R. Xu, X. Wang, P. Hou, H. Tang, and M. Song, “Hearing lips: Improving lip reading by distilling speech recognizers,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 6917–6924.
  21. S. Ren, Y. Du, J. Lv, G. Han, and S. He, “Learning from the master: Distilling cross-modal advanced knowledge for lip reading,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 325–13 333.
  22. T. Afouras, J. S. Chung, and A. Zisserman, “Asr is all you need: Cross-modal distillation for lip reading,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 2143–2147.
  23. H. Mabrouk, O. Abugabal, N. Sakr, and H. M. Eraqi, “Lip-listening: Mixing senses to understand lips using cross modality knowledge distillation for word-based models,” arXiv preprint arXiv:2207.05692, 2022.
  24. J.-X. Zhang, G. Wan, Z.-H. Ling, J. Pan, J. Gao, and C. Liu, “Self-supervised audio-visual speech representations learning by multimodal self-distillation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  25. S. Elashmawy, M. Ramsis, H. M. Eraqi, F. Eldeshnawy, H. Mabrouk, O. Abugabal, and N. Sakr, “Spatio-temporal attention mechanism and knowledge distillation for lip reading,” arXiv preprint arXiv:2108.03543, 2021.
  26. X. Huang and Y. Peng, “Deep cross-media knowledge transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8837–8846.
  27. Y. Peng and J. Qi, “Cm-gans: Cross-modal generative adversarial networks for common representation learning,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 15, no. 1, pp. 1–24, 2019.
  28. K. Lin, X. Xu, L. Gao, Z. Wang, and H. T. Shen, “Learning cross-aligned latent embeddings for zero-shot cross-modal retrieval,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 515–11 522.
  29. M. Kim, J. Hong, S. J. Park, and Y. M. Ro, “Cromm-vsr: Cross-modal memory augmented visual speech recognition,” IEEE Transactions on Multimedia, 2021.
  30. M. Kim, J. Yeo, and Y. M. Ro, “Distinguishing homophenes using multi-head visual-audio memory,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  31. J. H. Yeo, M. Kim, and Y. M. Ro, “Multi-temporal lip-audio memory for visual speech recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  32. J. Weston, S. Chopra, and A. Bordes, “Memory networks,” arXiv preprint arXiv:1410.3916, 2014.
  33. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  34. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  35. J.-S. Lee and C. H. Park, “Robust audio-visual speech recognition based on late integration,” IEEE Transactions on Multimedia, vol. 10, no. 5, pp. 767–779, 2008.
  36. S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE transactions on multimedia, vol. 2, no. 3, pp. 141–151, 2000.
  37. C. Peláez-Moreno, A. Gallardo-Antolín, and F. Díaz-de María, “Recognizing voice over ip: A robust front-end for speech recognition on the world wide web,” IEEE Transactions on Multimedia, vol. 3, no. 2, pp. 209–218, 2001.
  38. C. C. Chibelushi, F. Deravi, and J. S. Mason, “A review of speech-based bimodal recognition,” IEEE transactions on multimedia, vol. 4, no. 1, pp. 23–37, 2002.
  39. G. Zheng, Y. Xiao, K. Gong, P. Zhou, X. Liang, and L. Lin, “Wav-bert: Cooperative acoustic and linguistic representation learning for low-resource speech recognition,” arXiv preprint arXiv:2109.09161, 2021.
  40. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  41. Y. Zhao, G. Wang, C. Luo, W. Zeng, and Z.-J. Zha, “Self-supervised visual representations learning by contrastive mask prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 160–10 169.
  42. A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
  43. P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 873–12 883.
  44. K. Lakhotia, E. Kharitonov, W.-N. Hsu, Y. Adi, A. Polyak, B. Bolte, T.-A. Nguyen, J. Copet, A. Baevski, A. Mohamed et al., “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
  45. J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Asian conference on computer vision.   Springer, 2016, pp. 87–103.
  46. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  47. F. Tao and C. Busso, “End-to-end audiovisual speech recognition system with multitask learning,” IEEE Transactions on Multimedia, vol. 23, pp. 1–11, 2020.
  48. X. Weng and K. Kitani, “Learning spatio-temporal features with two-stream deep 3d cnns for lipreading,” arXiv preprint arXiv:1905.02540, 2019.
  49. J. Xiao, S. Yang, Y. Zhang, S. Shan, and X. Chen, “Deformation flow based two-stream network for lip reading,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).   IEEE, 2020, pp. 364–370.
  50. Y. Zhang, S. Yang, J. Xiao, S. Shan, and X. Chen, “Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).   IEEE, 2020, pp. 356–363.
  51. B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6319–6323.
  52. P. Ma, B. Martinez, S. Petridis, and M. Pantic, “Towards practical lipreading with distilled and efficient models,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 7608–7612.
  53. T. Furlanello, Z. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, “Born again neural networks,” in International Conference on Machine Learning.   PMLR, 2018, pp. 1607–1616.
  54. M. Kim, H. Kim, and Y. M. Ro, “Speaker-adaptive lip reading with user-dependent padding,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI.   Springer, 2022, pp. 576–593.
  55. Y. M. Assael, B. Shillingford, S. Whiteson, and N. De Freitas, “Lipnet: End-to-end sentence-level lipreading,” arXiv preprint arXiv:1611.01599, 2016.
  56. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  57. J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2017, pp. 3444–3453.
  58. J. Hong, M. Kim, and Y. M. Ro, “Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition,” in 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022.   International Speech Communication Association, 2022, pp. 2838–2842.
  59. A. Koumparoulis and G. Potamianos, “Accurate and resource-efficient lipreading with efficientnetv2 and transformers,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 8467–8471.
  60. S. Lee, H. G. Kim, D. H. Choi, H.-I. Kim, and Y. M. Ro, “Video prediction recalling long-term motion context via memory alignment learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3054–3063.
  61. M. Kim, J. Hong, S. J. Park, and Y. M. Ro, “Multi-modality associative bridging through memory: Speech sound recollected from face video,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 296–306.
  62. L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
  63. Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX.   Springer, 2020, pp. 104–120.
  64. L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, “Unified vision-language pre-training for image captioning and vqa,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 13 041–13 049.
  65. S. Zhang, T. Jiang, T. Wang, K. Kuang, Z. Zhao, J. Zhu, J. Yu, H. Yang, and F. Wu, “Devlbert: Learning deconfounded visio-linguistic representations,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4373–4382.
  66. Z.-Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, L. Yuan, N. Peng et al., “An empirical study of training end-to-end vision-and-language transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 166–18 176.
  67. C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7464–7473.
  68. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  69. J. Zhao and W.-Q. Zhang, “Improving automatic speech recognition performance for low-resource languages with self-supervised models,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1227–1241, 2022.
  70. J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Asian conference on computer vision.   Springer, 2016, pp. 251–263.
  71. P. Ma, R. Mira, S. Petridis, B. W. Schuller, and M. Pantic, “Lira: Learning visual speech representations from audio through self-supervision,” arXiv preprint arXiv:2106.09171, 2021.
  72. Y. Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding,” arXiv preprint arXiv:2111.02735, 2021.
  73. X. Chang, B. Yan, Y. Fujita, T. Maekaku, and S. Watanabe, “Exploration of efficient end-to-end asr using discretized input from self-supervised learning,” arXiv preprint arXiv:2305.18108, 2023.
  74. A. Lee, P.-J. Chen, C. Wang, J. Gu, S. Popuri, X. Ma, A. Polyak, Y. Adi, Q. He, Y. Tang et al., “Direct speech-to-speech translation with discrete units,” arXiv preprint arXiv:2107.05604, 2021.
  75. A. Pasad, B. Shi, and K. Livescu, “Comparative layer-wise analysis of self-supervised speech models,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  76. S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  77. Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2019.   NIH Public Access, 2019, p. 6558.
  78. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  79. T. Afouras, J. S. Chung, and A. Zisserman, “Lrs3-ted: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
  80. D. E. King, “Dlib-ml: A machine learning toolkit,” The Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009.
  81. X. Zhang, F. Cheng, and S. Wang, “Spatio-temporal fusion based convolutional sequence learning for lip reading,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 713–722.
  82. B. Xu, C. Lu, Y. Guo, and J. Wang, “Discriminative multi-modality speech recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14 433–14 442.
  83. B. Shillingford, Y. Assael, M. W. Hoffman, T. Paine, C. Hughes, U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett et al., “Large-scale visual speech recognition,” arXiv preprint arXiv:1807.05162, 2018.
  84. K. Prajwal, T. Afouras, and A. Zisserman, “Sub-word level lip reading with visual attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5162–5172.
  85. T. Makino, H. Liao, Y. Assael, B. Shillingford, B. Garcia, O. Braga, and O. Siohan, “Recurrent neural network transducer for audio-visual speech recognition,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU).   IEEE, 2019, pp. 905–912.
  86. D. Serdyuk, O. Braga, and O. Siohan, “Transformer-based video front-ends for audio-visual speech recognition for single and multi-person video,” arXiv preprint arXiv:2201.10439, 2022.
  87. X. Liu, E. Lakomkin, K. Vougioukas, P. Ma, H. Chen, R. Xie, M. Doulaty, N. Moritz, J. Kolar, S. Petridis et al., “Synthvsr: Scaling up visual speech recognition with synthetic supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 806–18 815.
  88. T. Lohrenz, B. Möller, Z. Li, and T. Fingscheidt, “Relaxed attention for transformer models,” in 2023 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2023, pp. 1–10.
  89. A. Haliassos, P. Ma, R. Mira, S. Petridis, and M. Pantic, “Jointly learning visual and auditory speech representations from raw data,” arXiv preprint arXiv:2212.06246, 2022.
  90. T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” arXiv preprint arXiv:1804.10959, 2018.
  91. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jeong Hun Yeo (12 papers)
  2. Minsu Kim (115 papers)
  3. Jeongsoo Choi (22 papers)
  4. Dae Hoe Kim (1 paper)
  5. Yong Man Ro (90 papers)
Citations (13)