Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception (2403.14402v2)

Published 21 Mar 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Speech recognition and translation systems perform poorly on noisy inputs, which are frequent in realistic environments. Augmenting these systems with visual signals has the potential to improve robustness to noise. However, audio-visual (AV) data is only available in limited amounts and for fewer languages than audio-only resources. To address this gap, we present XLAVS-R, a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation in over 100 languages. It is designed to maximize the benefits of limited multilingual AV pre-training data, by building on top of audio-only multilingual pre-training and simplifying existing pre-training schemes. Extensive evaluation on the MuAViC benchmark shows the strength of XLAVS-R on downstream audio-visual speech recognition and translation tasks, where it outperforms the previous state of the art by up to 18.5% WER and 4.7 BLEU given noisy AV inputs, and enables strong zero-shot audio-visual ability with audio-only fine-tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8717–8727.
  2. MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation. In Proc. INTERSPEECH 2023, pages 4064–4068.
  3. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Proc. Interspeech 2022, pages 2278–2282.
  4. data2vec: A general framework for self-supervised learning in speech, vision and language. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 1298–1312. PMLR.
  5. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc.
  6. When is multilinguality a curse? language modeling for 250 high- and low-resource languages.
  7. Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15689–15699, Los Alamitos, CA, USA. IEEE Computer Society.
  8. Self-supervised learning with random-projection quantizer for speech recognition. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 3915–3924. PMLR.
  9. Av2av: Direct audio-visual speech to audio-visual speech translation with unified audio-visual speech representation.
  10. Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
  11. Seamless Communication. 2023a. Seamless: Multilingual expressive and streaming speech translation.
  12. Seamless Communication. 2023b. Seamlessm4t: Massively multilingual & multimodal machine translation.
  13. Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Proc. Interspeech 2021, pages 2426–2430.
  14. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  15. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805.
  16. Jointly learning visual and auditory speech representations from raw data. In The Eleventh International Conference on Learning Representations.
  17. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:3451–3460.
  18. Wei-Ning Hsu and Bowen Shi. 2022. u-huBERT: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. In Advances in Neural Information Processing Systems.
  19. Av-data2vec: Self-supervised learning of audio-visual speech representations with contextualized target representations.
  20. End-to-end audio-visual speech recognition with conformers. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7613–7617.
  21. Lipreading using temporal convolutional networks. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6319–6323.
  22. Audio-visual fine-tuning of audio-only asr models.
  23. Deep multimodal learning for audio-visual speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2130–2134.
  24. De’hubert: Disentangling noise in a self-supervised model for robust speech recognition. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  25. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
  26. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  27. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191.
  28. Audio-visual automatic speech recognition: An overview. Issues in visual and audio-visual speech processing, 22:23.
  29. Scaling speech technology to 1,000+ languages.
  30. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  31. Avformer: Injecting vision into frozen speech models for zero-shot av-asr. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22922–22931.
  32. Learning audio-visual speech representation by masked multimodal cluster prediction. In International Conference on Learning Representations.
  33. Robust Self-Supervised Audio-Visual Speech Recognition. In Proc. Interspeech 2022, pages 2118–2122.
  34. Musan: A music, speech, and noise corpus.
  35. Themos Stafylakis and Georgios Tzimiropoulos. 2017. Combining residual networks with lstms for lipreading.
  36. William H Sumby and Irwin Pollack. 1954. Visual contribution to speech intelligibility in noise. The journal of the acoustical society of america, 26(2):212–215.
  37. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748.
  38. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  39. Superb: Speech processing universal performance benchmark. In Interspeech.
  40. Self-supervised audio-visual speech representations learning by multimodal self-distillation. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  41. Robust data2vec: Noise-robust speech representation learning for asr by combining regression and improved contrastive learning. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  42. Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning. IEEE Transactions on Multimedia, page 1–11.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. HyoJung Han (8 papers)
  2. Mohamed Anwar (5 papers)
  3. Juan Pino (50 papers)
  4. Wei-Ning Hsu (76 papers)
  5. Marine Carpuat (56 papers)
  6. Bowen Shi (82 papers)
  7. Changhan Wang (46 papers)
Citations (4)
Reddit Logo Streamline Icon: https://streamlinehq.com