SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition (2401.09759v2)

Published 18 Jan 2024 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.

References (24)

Citations (2)

View on Semantic Scholar

Summary

The paper presents SlideAVSR, a dataset that leverages visual slide cues to boost transcription accuracy for technical speech.
It details a robust methodology that mines paper explanation videos using advanced tools like ChatGPT and BLIP-2 for quality assurance.
The study also introduces DocWhisper, a modified ASR model with OCR and FQ Ranker, which significantly improves recognition of complex scientific terminology.

Introduction to SlideAVSR

In the field of artificial intelligence, multimodal models, which are adept at processing different types of data like language, images, videos, and audio, are becoming increasingly nuanced. A niche but growing field within this domain is audio-visual speech recognition (AVSR), which enhances automatic speech recognition (ASR) by supporting it with video data. Traditional AVSR mainly deals with improving the recognition of spoken words through lip-reading. However, this paper presents a different approach by introducing the SlideAVSR dataset, which emphasizes the audio-visual comprehension of speech in the context of scientific paper explanation videos.

Creating the SlideAVSR Dataset

The authors of the paper have created SlideAVSR to address the challenge of transcribing technical terminologies commonly found in academic presentations. These terms are difficult for standard ASR to recognize without additional textual references. The dataset was constructed by mining paper explanation videos from YouTube that include visual data in the form of presentation slides, which are often paired with spoken technical jargon. This creates a unique context for AVSR systems to incorporate visual cues from text to accurately recognize speech.

SlideAVSR attempts to fill a gap in existing AVSR datasets, which are predominantly centered around lip-reading. A methodical process was used to collect and refine the data, ensuring quality and relevance. Videos had to meet specific criteria such as duration, format, and manual subtitles. Filtering primarily depended on the usage of complex automatic tools like ChatGPT for analyzing video descriptions and BLIP-2 for confirming the presence of slides in still images from videos.

Enhancing AVSR with DocWhisper

To complement the introduction of SlideAVSR, the authors propose DocWhisper—a modified version of an existing ASR technology, Whisper. DocWhisper integrates optical character recognition (OCR) to utilize textual information from slides, potentially assisting in capturing tricky scientific terminologies that plain audio might miss.

The newly proposed FQ Ranker works with DocWhisper, sorting OCR results based upon word frequency data, prioritizing rarer and presumably more difficult to recognize terms. Experiments show that DocWhisper, particularly when combined with FQ Ranker, enhances performance significantly on the SlideAVSR dataset, especially when dealing with the subtleties of speech such as accents, which have been historically challenging for ASR systems to interpret.

Looking Forward

The findings underscore the potential for considerable improvements in AVSR systems when they effectively utilize visual context. This is a considerable step forward for applications that involve complex linguistic contexts such as academia or technical industries. Future research direction includes refining OCR-based methods, constructing end-to-end AVSR models, and expanding benchmarks to assess AVSR models' image comprehension over a wider range of video types.

The creation of SlideAVSR marks an important contribution to the AVSR field, enabling more accurate models capable of handling speech in a broader context than simply lip-reading. Such advancements are expected to give rise to more sophisticated AVSR systems that will better understand and transcribe spoken language across varying settings.

Limitations and Ethical Considerations

It's noted that SlideAVSR currently has a limited number of speakers and thus may not be entirely representative of all demographics. Also, the stratification of accents is still an ongoing process with room for refinement. The authors affirm their commitment to ethical research, citing the intent to release only non-sensitive data and restrain usage to purely research applications, thereby respecting individuals' privacy and adhering to YouTube's data policies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/810396815/status/1748157162449412297

https://twitter.com/ArxivSound/status/1808606357978788249