AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations (2302.06419v2)
Abstract: Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.
- “Speech perception,” Annual review of psychology, vol. 55, no. 1, pp. 149–179, 2004.
- “The origin of speech,” Scientific American, vol. 203, no. 3, pp. 88–97, 1960.
- “Learning audio-visual speech representation by masked multimodal cluster prediction,” Proceedings of the International Conference on Learning Representations (ICLR), 2022.
- “Robust self-supervised audio-visual speech recognition,” Interspeech, 2022.
- “Transformer-based video front-ends for audio-visual speech recognition,” Interspeech, 2022.
- “Jointly learning visual and auditory speech representations from raw data,” Proceedings of the International Conference on Learning Representations (ICLR), 2023.
- “Data2vec: A general framework for self-supervised learning in speech, vision and language,” International Conference on Machine Learning(ICML), 2022.
- KP Green, “The use of auditory and visual information during phonetic processing: implications for theories of speech perception. campbell r, dodd b, burnham d, editors. hearing by eye ii: advances in the psychology of speechreading and auditory–visual speech,” 1998.
- “u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality,” in Advances in Neural Information Processing Systems, 2022.
- “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
- “Generative pre-training for speech with autoregressive predictive coding,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3497–3501.
- “Vector-quantized autoregressive predictive coding,” Interspeech, 2020.
- “Deep contextualized acoustic representations for semi-supervised speech recognition,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6429–6433.
- “Phonetically motivated self-supervised speech representation learning.,” in Interspeech, 2021, pp. 746–750.
- “Non-autoregressive predictive coding for learning speech representations from local dependencies,” Interspeech, 2021.
- “Effectiveness of self-supervised pre-training for speech recognition,” arXiv preprint arXiv:1911.03912, 2019.
- “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- “Efficient self-supervised learning with contextualized target representations for vision, speech and language,” in International Conference on Machine Learning. PMLR, 2023, pp. 1416–1429.
- “Deep audio-visual speech recognition,” IEEE transactions on pattern analysis and machine intelligence, 2018.
- “Discriminative multi-modality speech recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14433–14442.
- “Large-scale visual speech recognition,” Interspeech, 2019.
- “Visual speech recognition for multiple languages in the wild,” Nature Machine Intelligence, vol. 4, no. 11, pp. 930–939, oct 2022.
- “Recurrent neural network transducer for audio-visual speech recognition,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2019, pp. 905–912.
- “Sub-word level lip reading with visual attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- “Audio-visual speech recognition is worth 32x32x8 voxels,” 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 796–802, 2021.
- “Asr is all you need: Cross-modal distillation for lip reading,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 2143–2147.
- “VatLM: Visual-audio-text pre-training with unified masked prediction for speech representation learning,” IEEE Transactions on Multimedia, pp. 1–11, 2023.
- “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- “Audio-visual speech recognition with a hybrid ctc/attention architecture,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 513–520.
- “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456.
- “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
- “End-to-end attention-based large vocabulary speech recognition,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 4945–4949.
- “Lrs3-ted: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
- “Voxceleb2: Deep speaker recognition,” Interspeech, 2018.
- Davis E King, “Dlib-ml: A machine learning toolkit,” The Journal of Machine Learning Research, vol. 10, pp. 1755–1758, 2009.
- “fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, June 2019, pp. 48–53.
- “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186.
- Taku Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.
- “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
- “VideoCLIP: Contrastive pre-training for zero-shot video-text understanding,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Nov. 2021, pp. 6787–6800.
- “Deep Neural Convolutive Matrix Factorization for Articulatory Representation Decomposition,” in Proc. Interspeech 2022, 2022, pp. 4686–4690.
- “Articulatory representation learning via joint factor analysis and neural matrix factorization,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Speaker-independent acoustic-to-articulatory speech inversion,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Deep Speech Synthesis from MRI-Based Articulatory Representations,” in Proc. INTERSPEECH 2023, 2023, pp. 5132–5136.