Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition (2401.09759v2)

Published 18 Jan 2024 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8717–8727.
  2. Lrs3-ted: a large-scale dataset for visual speech recognition.
  3. Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3444–3453.
  4. Joon Son Chung and Andrew Zisserman. 2017a. Lip reading in profile.
  5. Joon Son Chung and Andrew Zisserman. 2017b. Lip reading in the wild. In Computer Vision – ACCV 2016, pages 87–103, Cham. Springer International Publishing.
  6. Max Coltheart. 1981. The mrc psycholinguistic database. The Quarterly Journal of Experimental Psychology Section A, 33(4):497–505.
  7. Global performance disparities between english-language accents in automatic speech recognition.
  8. Avatar: Unconstrained audiovisual speech recognition.
  9. Jointly learning visual and auditory speech representations from raw data. In The Eleventh International Conference on Learning Representations.
  10. Ego4d challenge 2023. Https://ego4d-data.org/docs/challenge/.
  11. Svarah: Evaluating english asr systems on indian accents.
  12. Ctc-segmentation of large corpora for german end-to-end speech recognition. In International Conference on Speech and Computer, pages 267–278. Springer.
  13. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML.
  14. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization.
  15. Artie bias corpus: An open dataset for detecting demographic bias in speech applications. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6462–6468, Marseille, France. European Language Resources Association.
  16. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV.
  17. Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4491–4503, Dublin, Ireland. Association for Computational Linguistics.
  18. Prompting the hidden talent of web-scale speech models for zero-shot task generalization. In Interspeech.
  19. Learning transferable visual models from natural language supervision.
  20. Robust speech recognition via large-scale weak supervision.
  21. Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184.
  22. Large-scale visual speech recognition. In Proc. Interspeech 2019, pages 4135–4139.
  23. Jtubespeech: corpus of japanese speech collected from youtube for speech recognition and speaker verification.
  24. Kumiko Tanaka-Ishii. 2021. Statistical Universals of Language. Springer Cham.
Citations (2)

Summary

  • The paper presents SlideAVSR, a dataset that leverages visual slide cues to boost transcription accuracy for technical speech.
  • It details a robust methodology that mines paper explanation videos using advanced tools like ChatGPT and BLIP-2 for quality assurance.
  • The study also introduces DocWhisper, a modified ASR model with OCR and FQ Ranker, which significantly improves recognition of complex scientific terminology.

Introduction to SlideAVSR

In the field of artificial intelligence, multimodal models, which are adept at processing different types of data like language, images, videos, and audio, are becoming increasingly nuanced. A niche but growing field within this domain is audio-visual speech recognition (AVSR), which enhances automatic speech recognition (ASR) by supporting it with video data. Traditional AVSR mainly deals with improving the recognition of spoken words through lip-reading. However, this paper presents a different approach by introducing the SlideAVSR dataset, which emphasizes the audio-visual comprehension of speech in the context of scientific paper explanation videos.

Creating the SlideAVSR Dataset

The authors of the paper have created SlideAVSR to address the challenge of transcribing technical terminologies commonly found in academic presentations. These terms are difficult for standard ASR to recognize without additional textual references. The dataset was constructed by mining paper explanation videos from YouTube that include visual data in the form of presentation slides, which are often paired with spoken technical jargon. This creates a unique context for AVSR systems to incorporate visual cues from text to accurately recognize speech.

SlideAVSR attempts to fill a gap in existing AVSR datasets, which are predominantly centered around lip-reading. A methodical process was used to collect and refine the data, ensuring quality and relevance. Videos had to meet specific criteria such as duration, format, and manual subtitles. Filtering primarily depended on the usage of complex automatic tools like ChatGPT for analyzing video descriptions and BLIP-2 for confirming the presence of slides in still images from videos.

Enhancing AVSR with DocWhisper

To complement the introduction of SlideAVSR, the authors propose DocWhisper—a modified version of an existing ASR technology, Whisper. DocWhisper integrates optical character recognition (OCR) to utilize textual information from slides, potentially assisting in capturing tricky scientific terminologies that plain audio might miss.

The newly proposed FQ Ranker works with DocWhisper, sorting OCR results based upon word frequency data, prioritizing rarer and presumably more difficult to recognize terms. Experiments show that DocWhisper, particularly when combined with FQ Ranker, enhances performance significantly on the SlideAVSR dataset, especially when dealing with the subtleties of speech such as accents, which have been historically challenging for ASR systems to interpret.

Looking Forward

The findings underscore the potential for considerable improvements in AVSR systems when they effectively utilize visual context. This is a considerable step forward for applications that involve complex linguistic contexts such as academia or technical industries. Future research direction includes refining OCR-based methods, constructing end-to-end AVSR models, and expanding benchmarks to assess AVSR models' image comprehension over a wider range of video types.

The creation of SlideAVSR marks an important contribution to the AVSR field, enabling more accurate models capable of handling speech in a broader context than simply lip-reading. Such advancements are expected to give rise to more sophisticated AVSR systems that will better understand and transcribe spoken language across varying settings.

Limitations and Ethical Considerations

It's noted that SlideAVSR currently has a limited number of speakers and thus may not be entirely representative of all demographics. Also, the stratification of accents is still an ongoing process with room for refinement. The authors affirm their commitment to ethical research, citing the intent to release only non-sensitive data and restrain usage to purely research applications, thereby respecting individuals' privacy and adhering to YouTube's data policies.