Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing (2402.15151v2)
Abstract: In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements compared to the recent model trained with 433 hours of data.
- Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496.
- Asr is all you need: Cross-modal distillation for lip reading. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2143–2147. IEEE.
- Muavic: A multilingual audio-visual corpus for robust speech recognition and robust speech-to-text translation. arXiv preprint arXiv:2303.00628.
- Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Conformers are all you need for visual speech recogntion. arXiv preprint arXiv:2302.10915.
- Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15735–15745.
- Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622.
- Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.
- Joon Son Chung and Andrew Zisserman. 2017a. Lip reading in the wild. In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 87–103. Springer.
- Joon Son Chung and Andrew Zisserman. 2017b. Lip reading in the wild. In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 87–103. Springer.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014.
- Retinaface: Single-shot multi-level face localisation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48.
- Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795.
- Alex Graves and Alex Graves. 2012. Connectionist temporal classification. Supervised sequence labelling with recurrent neural networks, pages 61–93.
- Jointly learning visual and auditory speech representations from raw data. In The Eleventh International Conference on Learning Representations.
- Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
- Cromm-vsr: Cross-modal memory augmented visual speech recognition. IEEE Transactions on Multimedia, 24:4342–4355.
- Distinguishing homophenes using multi-head visual-audio memory for lip reading. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1174–1182.
- On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
- Textless speech-to-speech translation on real data. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 860–872.
- Auto-avsr: Audio-visual speech recognition with automatic labels. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Towards practical lipreading with distilled and efficient models. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7608–7612. IEEE.
- End-to-end audio-visual speech recognition with conformers. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7613–7617. IEEE.
- Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence, 4(11):930–939.
- Training strategies for improved lip-reading. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8472–8476. IEEE.
- Recurrent neural network transducer for audio-visual speech recognition. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pages 905–912. IEEE.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Comparative layer-wise analysis of self-supervised speech models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- End-to-end visual speech recognition with lstms. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2592–2596. IEEE.
- Stavros Petridis and Maja Pantic. 2016. Deep complementary bottleneck features for visual speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2304–2308. IEEE.
- End-to-end audiovisual speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6548–6552. IEEE.
- Sub-word level lip reading with visual attention. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 5162–5172.
- Learning from the master: Distilling cross-modal advanced knowledge for lip reading. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13325–13333.
- Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
- Transformer-based video front-ends for audio-visual speech recognition for single and multi-person video. arXiv preprint arXiv:2201.10439.
- Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184.
- Large-scale visual speech recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 4135–4139. ISCA.
- Themos Stafylakis and Georgios Tzimiropoulos. 2017. Combining residual networks with lstms for lipreading. In Proc. Interspeech.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention is all you need. Advances in neural information processing systems, 30.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- On decoder-only architecture for speech-to-text and large language model integration. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519.
- Akvsr: Audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model.
- Multi-temporal lip-audio memory for visual speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Speechut: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1663–1676.
- Hearing lips: Improving lip reading by distilling speech recognizers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6917–6924.
- Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning. IEEE Transactions on Multimedia.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.