HowToCaption: Prompting LLMs to Transform Video Annotations at Scale (2310.04900v2)

Published 7 Oct 2023 in cs.CV

Abstract: Instructional videos are a common source for learning text-video or even multimodal representations by leveraging subtitles extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capabilities of LLMs to obtain high-quality video descriptions aligned with videos at scale. Specifically, we prompt an LLM to create plausible video captions based on ASR subtitles of instructional videos. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We further prompt the LLM to generate timestamps for each produced caption based on the timestamps of the subtitles and finally align the generated captions to the video temporally. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for zero-shot text-video retrieval and video captioning, but also lead to a disentangling of textual narration from the audio, boosting the performance in text-video-audio tasks.

PDF Abstract

Analysis of "HowToCaption: Prompting LLMs to Transform Video Annotations at Scale"

The research paper titled "HowToCaption: Prompting LLMs to Transform Video Annotations at Scale" addresses the challenge of improving the quality of textual annotations for instructional videos to train robust multimodal representations. The use of automatic speech recognition (ASR) subtitles as supervision is prevalent in current large-scale datasets like HowTo100M. However, these subtitles often misalign with the corresponding visual content due to their noisy and unstructured nature. This paper proposes an innovative framework named HowToCaption, which leverages LLMs to generate structured and semantically rich video captions from ASR subtitles, thereby providing a more effective supervisory signal for text-video models.

Methodology

The authors employ an LLM, specifically designed to process long context inputs, to generate plausible video descriptions directly from ASR subtitles. The approach involves using customized prompts to handle extended transcriptions, which allows the LLM to maintain context and coherence beyond a single sentence. Moreover, the LLM is tasked to estimate timestamps for each generated caption, enabling alignment with the video timeline. Post-processing steps involve filtering and temporal realignment using a text-video model, ensuring that only high-quality caption-video pairs are retained.

Experimental Results

The paper evaluates the performance of the generated captions through a novel dataset called HowToCaption, derived from the HowTo100M dataset. Models pre-trained on HowToCaption show marked improvement in zero-shot text-to-video retrieval tasks across various established benchmarks, including YouCook2, MSR-VTT, and others. Notably, HowToCaption's captions enable disassociation from audio cues, allowing for robust performance in multimodal tasks that utilize text, video, and audio without the need for additional regularization techniques.

Implications and Future Work

Practically, the HowToCaption methodology facilitates the creation of large-scale, annotated video datasets with significantly reduced human intervention, which is crucial for developing advanced multimedia applications. The refined captions enhance the training of models in tasks such as video classification, retrieval, and captioning by providing contextually appropriate and temporally aligned textual descriptions.

Theoretically, this work exemplifies the potential of LLMs in transforming noisy, low-quality supervisory signals into structured, high-quality training data. It opens avenues for further exploration of LLM capabilities in other linguistic transformations and their applications in diverse AI challenges. Future developments could explore integrating this approach with other modalities or optimizing LLM prompting strategies to handle even larger datasets across different video genres and languages.

In conclusion, the HowToCaption framework represents a significant step towards enhancing multimodal learning by addressing the intrinsic quality issues of existing large-scale video datasets. Its application not only improves downstream task performance but also expands the utility of LLMs in video understanding tasks, promising wide-reaching impacts in the AI research and development landscape.