Learning to Answer Visual Questions from Web Videos (2205.05019v2)

Published 10 May 2022 in cs.CV, cs.CL, and cs.LG

Abstract: Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results, in particular for rare answers. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA dataset generation approach generalizes to another source of web video and text data. We use our method to generate the WebVidVQA3M dataset from the WebVid dataset, i.e., videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations. Code, datasets and trained models are available at https://antoyang.github.io/just-ask.html

PDF Abstract

An Examination of Automatic Visual Question Answering Data Generation from Web Videos

The paper "Learning to Answer Visual Questions from Web Videos" addresses the challenges and limitations associated with large-scale data annotation for Video Question Answering (VideoQA) tasks by proposing a methodology to automatically generate data using readily available web video sources. Current methodologies for VideoQA heavily rely on manually annotated datasets which are often expensive, labor-intensive, and limited in size. This paper introduces an effective alternative by leveraging automatic cross-modal supervision to generate large-scale VideoQA datasets, offering a robust solution to scaling issues present in existing approaches.

Key Contributions

Data Generation Framework: The authors propose a method to automate the creation of VideoQA datasets using transcribed narrations from videos. By employing a question generation transformer trained on textual data, the method generates question-answer pairs from these transcriptions. The approach has resulted in the HowToVQA69M dataset, which comprises 69 million video-question-answer triplets.
Contrastive Training Methodology: To handle the open-ended nature of answers, the paper introduces a contrastive learning procedure, utilizing a video-question multi-modal transformer and an answer transformer. This is crucial in managing the diverse answer space provided by the generated dataset.
Zero-shot and Probe Evaluation: The introduction of a zero-shot VideoQA task and a VideoQA feature probe evaluation setting represent innovative steps to test the models' generalization without visual data supervision. The results demonstrate the efficacy of the proposed model, particularly in rare answer scenarios.
Comparison with Existing Datasets: The proposed method's scalability and generalizability are highlighted by its application to another data source—alt-text annotated web videos—leading to the generation of the WebVidVQA3M dataset. This extends the applicability and flexibility of the method across varying types of text-video data.
Introduction of the iVQA Benchmark: The paper also introduces the iVQA dataset with more controlled language bias and higher-quality annotations, offering an improved benchmark for evaluating VideoQA models.

Implications and Future Work

The implications of this research are significant for the field of AI and multimedia understanding. By reducing the reliance on manual annotation, this approach facilitates the creation of much larger and more diverse datasets, which in turn can lead to the development of more robust and generalizable VideoQA models. This carries potential for improving downstream tasks across various related domains such as video retrieval, summarization, and human-computer interaction in multimedia contexts.

Further work could explore extending these methods to even richer types of video content or combining with other modalities such as audio cues directly from the videos, potentially enhancing the richness of the input signal further. Exploring the implications of this scaling in different datasets and adapting these generative methods for domain-specific applications can also be fruitful areas for expansion.

Conclusions

The paper makes several valuable contributions to the VideoQA field by providing a scalable methodology for automated data generation, coupled with innovative training and evaluation methods. Through generating vast and diverse datasets using web videos, it provides pathways to alleviate some pressing challenges in model training scalability and efficiency. The results demonstrate considerable promise and pave the way for more nuanced and varied explorations in automated visual understanding tasks. Future investigations could continue to build on these techniques, exploring their full potential and addressing any emerging challenges in broader application scenarios.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Antoine Yang (12 papers)
Antoine Miech (23 papers)
Josef Sivic (78 papers)
Ivan Laptev (99 papers)
Cordelia Schmid (206 papers)

Citations (31)

View on Semantic Scholar

Learning to Answer Visual Questions from Web Videos (2205.05019v2)

An Examination of Automatic Visual Question Answering Data Generation from Web Videos

Key Contributions

Implications and Future Work

Conclusions

Related Papers

GitHub

YouTube