SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition (2401.09759v2)
Abstract: Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.
- Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8717–8727.
- Lrs3-ted: a large-scale dataset for visual speech recognition.
- Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3444–3453.
- Joon Son Chung and Andrew Zisserman. 2017a. Lip reading in profile.
- Joon Son Chung and Andrew Zisserman. 2017b. Lip reading in the wild. In Computer Vision – ACCV 2016, pages 87–103, Cham. Springer International Publishing.
- Max Coltheart. 1981. The mrc psycholinguistic database. The Quarterly Journal of Experimental Psychology Section A, 33(4):497–505.
- Global performance disparities between english-language accents in automatic speech recognition.
- Avatar: Unconstrained audiovisual speech recognition.
- Jointly learning visual and auditory speech representations from raw data. In The Eleventh International Conference on Learning Representations.
- Ego4d challenge 2023. Https://ego4d-data.org/docs/challenge/.
- Svarah: Evaluating english asr systems on indian accents.
- Ctc-segmentation of large corpora for german end-to-end speech recognition. In International Conference on Speech and Computer, pages 267–278. Springer.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
- Artie bias corpus: An open dataset for detecting demographic bias in speech applications. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6462–6468, Marseille, France. European Language Resources Association.
- HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV.
- Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4491–4503, Dublin, Ireland. Association for Computational Linguistics.
- Prompting the hidden talent of web-scale speech models for zero-shot task generalization. In Interspeech.
- Learning transferable visual models from natural language supervision.
- Robust speech recognition via large-scale weak supervision.
- Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184.
- Large-scale visual speech recognition. In Proc. Interspeech 2019, pages 4135–4139.
- Jtubespeech: corpus of japanese speech collected from youtube for speech recognition and speaker verification.
- Kumiko Tanaka-Ishii. 2021. Statistical Universals of Language. Springer Cham.