Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition (2306.10563v1)

Published 18 Jun 2023 in eess.AS, cs.CV, cs.MM, and cs.SD

Abstract: Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually faced with two practical challenges: 1) lack of sufficient labeled noisy audio-visual training data in some real-world scenarios and 2) less optimal model generality to unseen testing noises. In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a.k.a., unsupervised noise adaptation. Inspired by human perception mechanism, we propose a universal viseme-phoneme mapping (UniVPM) approach to implement modality transfer, which can restore clean audio from visual signals to enable speech recognition under any noisy conditions. Extensive experiments on public benchmarks LRS3 and LRS2 show that our approach achieves the state-of-the-art under various noisy as well as clean conditions. In addition, we also outperform previous state-of-the-arts on visual speech recognition task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence.
  2. Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496.
  3. Asr is all you need: Cross-modal distillation for lip reading. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2143–2147. IEEE.
  4. Differential auditory and visual phase-locking are observed during audio-visual benefit and silent lip-reading for speech perception. Journal of Neuroscience, 42(31):6108–6120.
  5. David Arthur and Sergei Vassilvitskii. 2006. k-means++: The advantages of careful seeding. Technical report, Stanford.
  6. Layer normalization. arXiv preprint arXiv:1607.06450.
  7. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460.
  8. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443.
  9. Helen L Bear and Richard Harvey. 2017. Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Communication, 95:40–67.
  10. Mutual information neural estimation. In International conference on machine learning, pages 531–540. PMLR.
  11. Auditory speech detection in noise enhanced by lipreading. Speech Communication, 44(1-4):5–18.
  12. Lip-reading enables the brain to synthesize auditory features of unknown silent speech. Journal of Neuroscience, 40(5):1053–1065.
  13. Multi-modal pre-training for automated speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 246–250. IEEE.
  14. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.
  15. Noise-robust speech recognition with 10 minutes unparalleled in-domain data. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4298–4302. IEEE.
  16. Self-critical sequence training for automatic speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3688–3692. IEEE.
  17. Metric-oriented speech enhancement using diffusion probabilistic model. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  18. Leveraging modality-specific representations for audio-visual speech recognition via reinforcement learning. arXiv preprint arXiv:2212.05301.
  19. Unsupervised noise adaptation using data simulation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  20. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12655–12663.
  21. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297.
  22. Data techniques for online end-to-end speech recognition. arXiv preprint arXiv:2001.09221.
  23. Lip reading sentences in the wild. In 2017 IEEE conference on computer vision and pattern recognition (CVPR), pages 3444–3453. IEEE.
  24. Monroe D Donsker and SR Srinivasa Varadhan. 1983. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212.
  25. Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations.
  26. Advances in knowledge discovery and data mining. American Association for Artificial Intelligence.
  27. Generative adversarial nets. In NeurIPS.
  28. Alex Graves. 2012. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711.
  29. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376.
  30. Conformer: Convolution-augmented transformer for speech recognition. In Interspeech, pages 5036–5040.
  31. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations.
  32. Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition. In 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022, pages 2838–2842. International Speech Communication Association.
  33. Wei-Ning Hsu and Bowen Shi. 2022. u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. In Advances in Neural Information Processing Systems.
  34. Gradient remedy for multi-task learning in end-to-end noise-robust speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  35. Wav2code: Restore clean speech representations via codebook lookup for noise-robust asr. arXiv preprint arXiv:2304.04974.
  36. Unifying speech enhancement and separation with gradient modulation for end-to-end noise-robust speech separation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  37. Dual-path style learning for end-to-end noise-robust speech recognition. arXiv preprint arXiv:2203.14838.
  38. Interactive feature fusion for end-to-end noise-robust speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6292–6296. IEEE.
  39. Cross-modal global interaction and local alignment for audio-visual speech recognition. arXiv preprint arXiv:2305.09212.
  40. Multi-modality associative bridging through memory: Speech sound recollected from face video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 296–306.
  41. Lip to speech synthesis with visual context attentional gan. Advances in Neural Information Processing Systems, 34:2758–2770.
  42. Distinguishing homophenes using multi-head visual-audio memory for lip reading. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, volume 22.
  43. Davis E King. 2009. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10:1755–1758.
  44. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  45. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75.
  46. Video prediction recalling long-term motion context via memory alignment learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3054–3063.
  47. Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport. Advances in Neural Information Processing Systems, 34:19935–19946.
  48. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2537–2546.
  49. Domain adaptation of lattice-free mmi based tdnn models for speech recognition. International Journal of Speech Technology, 20(1):171–178.
  50. Lira: Learning visual speech representations from audio through self-supervision. arXiv preprint arXiv:2106.09171.
  51. End-to-end audio-visual speech recognition with conformers. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7613–7617. IEEE.
  52. Visual speech recognition for multiple languages in the wild. arXiv preprint arXiv:2202.13084.
  53. J MacQueen. 1967. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability, pages 281–297.
  54. Recurrent neural network transducer for audio-visual speech recognition. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pages 905–912. IEEE.
  55. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature, 264(5588):746–748.
  56. Crossmodal phase reset and evoked responses provide complementary mechanisms for the influence of visual speech in auditory cortex. Journal of Neuroscience, 40(44):8530–8542.
  57. Unsupervised adaptation with domain separation networks for robust speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 214–221. IEEE.
  58. Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126.
  59. Audrey R Nath and Michael S Beauchamp. 2011. Dynamic changes in superior temporal sulcus connectivity during perception of noisy audiovisual speech. Journal of Neuroscience, 31(5):1704–1714.
  60. Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4491–4503, Dublin, Ireland. Association for Computational Linguistics.
  61. Synctalkface: Talking face generation with precise lip-syncing via audio-lip memory. In 36th AAAI Conference on Artificial Intelligence (AAAI 22). Association for the Advancement of Artificial Intelligence.
  62. Audio-visual speech recognition with a hybrid ctc/attention architecture. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 513–520. IEEE.
  63. Vitou Phy. 2022. Automatic Phoneme Recognition on TIMIT Dataset with Wav2Vec 2.0. If you use this model, please cite it using these metadata.
  64. Learning individual speaking styles for accurate lip to speech synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13796–13805.
  65. Lipsound: Neural mel-spectrogram reconstruction for lip reading. In INTERSPEECH, pages 2768–2772.
  66. Lipsound2: Self-supervised pre-training for lip-to-speech reconstruction and lip reading. arXiv preprint arXiv:2112.04748.
  67. Rapid learning or feature reuse? towards understanding the effectiveness of maml. In International Conference on Learning Representations.
  68. Learning from the master: Distilling cross-modal advanced knowledge for lip reading. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13325–13333.
  69. Robert T Sataloff. 1992. The human voice. Scientific American, 267(6):108–115.
  70. Learning audio-visual speech representation by masked multimodal cluster prediction. In International Conference on Learning Representations.
  71. Robust self-supervised audio-visual speech recognition. arXiv preprint arXiv:2201.01763.
  72. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484.
  73. Deep memory network for cross-modal retrieval. IEEE Transactions on Multimedia, 21(5):1261–1275.
  74. Multimodal sparse transformer network for audio-visual speech recognition. IEEE Transactions on Neural Networks and Learning Systems.
  75. William H Sumby and Irwin Pollack. 1954. Visual contribution to speech intelligibility in noise. The journal of the acoustical society of america, 26(2):212–215.
  76. The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings. In Proceedings of Meetings on Acoustics ICA2013, volume 19, page 035081. Acoustical Society of America.
  77. Attention is all you need. Advances in neural information processing systems, 30.
  78. Complex spectral mapping for single-and multi-channel speech enhancement and robust asr. IEEE/ACM transactions on audio, speech, and language processing, 28:1778–1787.
  79. Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8):1240–1253.
  80. Memory networks. arXiv preprint arXiv:1410.3916.
  81. K-means clustering versus validation measures: a data distribution perspective. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 779–784.
  82. Discriminative multi-modality speech recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14433–14442.
  83. Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6984–6988. IEEE.
  84. On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training. arXiv preprint arXiv:2205.01751.
  85. Eleatt-rnn: Adding attentiveness to neurons in recurrent neural networks. IEEE Transactions on Image Processing, 29:1061–1073.
  86. Hearing lips: Improving lip reading by distilling speech recognizers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6917–6924.
  87. Arbitrary talking face generation via attentional audio-visual coherence learning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 2362–2368.
  88. Deep audio-visual learning: A survey. International Journal of Automation and Computing, 18(3):351–376.
  89. A joint speech enhancement and self-supervised representation learning framework for noise-robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  90. Robust data2vec: Noise-robust speech representation learning for asr by combining regression and improved contrastive learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  91. Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning. IEEE Transactions on Multimedia.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yuchen Hu (60 papers)
  2. Ruizhe Li (40 papers)
  3. Chen Chen (752 papers)
  4. Chengwei Qin (28 papers)
  5. Qiushi Zhu (11 papers)
  6. Eng Siong Chng (112 papers)
Citations (3)