Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations (2302.06419v2)

Published 10 Feb 2023 in eess.AS, cs.AI, and cs.CL

Abstract: Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. “Speech perception,” Annual review of psychology, vol. 55, no. 1, pp. 149–179, 2004.
  2. “The origin of speech,” Scientific American, vol. 203, no. 3, pp. 88–97, 1960.
  3. “Learning audio-visual speech representation by masked multimodal cluster prediction,” Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  4. “Robust self-supervised audio-visual speech recognition,” Interspeech, 2022.
  5. “Transformer-based video front-ends for audio-visual speech recognition,” Interspeech, 2022.
  6. “Jointly learning visual and auditory speech representations from raw data,” Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  7. “Data2vec: A general framework for self-supervised learning in speech, vision and language,” International Conference on Machine Learning(ICML), 2022.
  8. KP Green, “The use of auditory and visual information during phonetic processing: implications for theories of speech perception. campbell r, dodd b, burnham d, editors. hearing by eye ii: advances in the psychology of speechreading and auditory–visual speech,” 1998.
  9. “u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality,” in Advances in Neural Information Processing Systems, 2022.
  10. “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
  11. “Generative pre-training for speech with autoregressive predictive coding,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3497–3501.
  12. “Vector-quantized autoregressive predictive coding,” Interspeech, 2020.
  13. “Deep contextualized acoustic representations for semi-supervised speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6429–6433.
  14. “Phonetically motivated self-supervised speech representation learning.,” in Interspeech, 2021, pp. 746–750.
  15. “Non-autoregressive predictive coding for learning speech representations from local dependencies,” Interspeech, 2021.
  16. “Effectiveness of self-supervised pre-training for speech recognition,” arXiv preprint arXiv:1911.03912, 2019.
  17. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  18. “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  19. “Efficient self-supervised learning with contextualized target representations for vision, speech and language,” in International Conference on Machine Learning. PMLR, 2023, pp. 1416–1429.
  20. “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018.
  21. “Discriminative multi-modality speech recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14433–14442.
  22. “Large-scale visual speech recognition,” Interspeech, 2019.
  23. “Visual speech recognition for multiple languages in the wild,” Nature Machine Intelligence, vol. 4, no. 11, pp. 930–939, oct 2022.
  24. “Recurrent neural network transducer for audio-visual speech recognition,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2019, pp. 905–912.
  25. “Sub-word level lip reading with visual attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  26. “Audio-visual speech recognition is worth 32x32x8 voxels,” 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 796–802, 2021.
  27. “Asr is all you need: Cross-modal distillation for lip reading,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 2143–2147.
  28. “VatLM: Visual-audio-text pre-training with unified masked prediction for speech representation learning,” IEEE Transactions on Multimedia, pp. 1–11, 2023.
  29. “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  30. “Audio-visual speech recognition with a hybrid ctc/attention architecture,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 513–520.
  31. “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456.
  32. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  33. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
  34. “End-to-end attention-based large vocabulary speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 4945–4949.
  35. “Lrs3-ted: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
  36. “Voxceleb2: Deep speaker recognition,” Interspeech, 2018.
  37. Davis E King, “Dlib-ml: A machine learning toolkit,” The Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009.
  38. “fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, pp. 48–53.
  39. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186.
  40. Taku Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.
  41. “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  42. “VideoCLIP: Contrastive pre-training for zero-shot video-text understanding,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Nov. 2021, pp. 6787–6800.
  43. “Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition,” in Proc. Interspeech 2022, 2022, pp. 4686–4690.
  44. “Articulatory representation learning via joint factor analysis and neural matrix factorization,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  45. “Speaker-independent acoustic-to-articulatory speech inversion,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  46. “Deep Speech Synthesis from MRI-Based Articulatory Representations,” in Proc. INTERSPEECH 2023, 2023, pp. 5132–5136.
Citations (23)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com