Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge (2308.09311v2)
Abstract: This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated.
- Hearing by eye: The psychology of lip-reading. Lawrence Erlbaum Associates, Inc, 1987.
- Spatio-temporal fusion based convolutional sequence learning for lip reading. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 713–722, 2019.
- Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184, 2022.
- Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018.
- Lip reading in the wild. In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 87–103. Springer, 2017.
- Lrw-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019), pages 1–8. IEEE, 2019.
- Lip reading sentences in the wild. In 2017 IEEE conference on computer vision and pattern recognition (CVPR), pages 3444–3453. IEEE, 2017.
- A cascade sequence-to-sequence model for chinese mandarin lip reading. In Proceedings of the ACM Multimedia Asia, pages 1–6. 2019.
- Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016.
- Combining residual networks with lstms for lipreading. arXiv preprint arXiv:1703.04105, 2017.
- End-to-end audiovisual speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6548–6552. IEEE, 2018.
- Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018.
- Deformation flow based two-stream network for lip reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 364–370. IEEE, 2020.
- Lip to speech synthesis with visual context attentional gan. Advances in Neural Information Processing Systems, 34:2758–2770, 2021.
- End-to-end audio-visual speech recognition with conformers. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7613–7617. IEEE, 2021.
- Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 251–263. Springer, 2017.
- Hearing lips: Improving lip reading by distilling speech recognizers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6917–6924, 2020.
- Lira: Learning visual speech representations from audio through self-supervision. arXiv preprint arXiv:2106.09171, 2021.
- Cromm-vsr: Cross-modal memory augmented visual speech recognition. IEEE Transactions on Multimedia, 24:4342–4355, 2021.
- Multi-modality associative bridging through memory: Speech sound recollected from face video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 296–306, 2021.
- Learning from the master: Distilling cross-modal advanced knowledge for lip reading. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13325–13333, 2021.
- Distinguishing homophenes using multi-head visual-audio memory for lip reading. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1174–1182, 2022.
- Learn an effective lip reading model without pains. arXiv preprint arXiv:2011.07557, 2020.
- Towards practical lipreading with distilled and efficient models. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7608–7612. IEEE, 2021.
- Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence, pages 1–10, 2022.
- Multilingual tedx corpus for speech recognition and translation. In Proceedings of Interspeech, 2021.
- Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Communication, 35(1-2):31–51, 2001.
- Multilingual deep neural network based acoustic modeling for rapid language adaptation. In 2014 IEEE international Conference on acoustics, speech and signal processing (ICASSP), pages 7639–7643. IEEE, 2014.
- Synchronous bidirectional learning for multilingual lip reading. arXiv preprint arXiv:2005.03846, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Mutual information maximization for effective lip reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 420–427. IEEE, 2020.
- Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 356–363. IEEE, 2020.
- Speech reconstruction with reminiscent sound via visual voice memory. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3654–3667, 2021.
- Lip-to-speech synthesis in the wild with multi-task learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Intelligible lip-to-speech synthesis with speech units. arXiv preprint arXiv:2305.19603, 2023.
- Auto-avsr: Audio-visual speech recognition with automatic labels. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- Learning spatio-temporal features with two-stream deep 3d cnns for lipreading. arXiv preprint arXiv:1905.02540, 2019.
- Lipreading using temporal convolutional networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6319–6323. IEEE, 2020.
- Connectionist temporal classification. Supervised sequence labelling with recurrent neural networks, pages 61–93, 2012.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition. arXiv preprint arXiv:2207.06020, 2022.
- Sub-word level lip reading with visual attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5162–5172, 2022.
- Watch or listen: Robust audio-visual speech recognition with visual corruption modeling and reliability scoring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18783–18794, 2023.
- Asr is all you need: Cross-modal distillation for lip reading. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2143–2147. IEEE, 2020.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Multi-temporal lip-audio memory for visual speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Akvsr: Audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model, 2023.
- Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2722–2726. IEEE, 2016.
- Speaker-adaptive lip reading with user-dependent padding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 576–593. Springer, 2022.
- Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition. arXiv preprint arXiv:2302.08102, 2023.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pages 104–120. Springer, 2020.
- Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13041–13049, 2020.
- Devlbert: Learning deconfounded visio-linguistic representations. In Proceedings of the 28th ACM International Conference on Multimedia, pages 4373–4382, 2020.
- An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176, 2022.
- Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
- On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
- vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019.
- Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
- Exploration of efficient end-to-end asr using discretized input from self-supervised learning. arXiv preprint arXiv:2305.18108, 2023.
- Many-to-many spoken language translation via unified speech and text representation learning with unit-to-unit translation. arXiv preprint arXiv:2308.01831, 2023.
- Robust self-supervised audio-visual speech recognition. arXiv preprint arXiv:2201.01763, 2022.
- Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
- Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
- Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390, 2021.
- fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
- Dsfd: Dual shot face detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Lightweight and effective facial landmark detection using adversarial learning with face geometric map generative network. IEEE Transactions on Circuits and Systems for Video Technology, 30(3):771–780, 2019.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- Mls: A large-scale multilingual dataset for speech research. ArXiv, abs/2012.03411, 2020.
- Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8):1240–1253, 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
- Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning. IEEE Transactions on Multimedia, 2023.
- Minsu Kim (115 papers)
- Jeong Hun Yeo (12 papers)
- Jeongsoo Choi (22 papers)
- Yong Man Ro (91 papers)