ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering (1906.02467v1)

Published 6 Jun 2019 in cs.CV

Abstract: Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce ActivityNet-QA, a fully annotated and large scale VideoQA dataset. The dataset consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset. We present a statistical analysis of our ActivityNet-QA dataset and conduct extensive experiments on it by comparing existing VideoQA baselines. Moreover, we explore various video representation strategies to improve VideoQA performance, especially for long videos. The dataset is available at https://github.com/MILVLG/activitynet-qa

Citations (330)

View on Semantic Scholar

Summary

The paper introduces a large-scale, fully human-annotated dataset to overcome limitations in existing VideoQA resources.
It employs diverse untrimmed video samples and varied question types to assess spatio-temporal understanding via dynamic sampling strategies.
Baseline experiments reveal challenges in temporal reasoning and highlight the need for advanced methods in multimodal video analysis.

ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering

The paper "ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering" introduces ActivityNet-QA, a comprehensive dataset designed to advance research in video question answering (VideoQA). The authors address the limitations of pre-existing VideoQA datasets, which include small scale, automatically generated QA pairs, and short videos with limited activity diversity. ActivityNet-QA is positioned as a solution to these shortcomings, offering a large scale, fully human-annotated dataset with videos of considerable length and variety.

Dataset Overview

ActivityNet-QA is derived from ActivityNet, a dataset of untrimmed web videos encompassing approximately 20,000 samples across 200 action classes. For ActivityNet-QA, the authors selected a subset of 5,800 videos and provided them with 58,000 question-answer pairs created through crowdsourcing. This human annotation process ensures a higher quality of data compared to datasets generated via automated means. The dataset's QA pairs are designed to probe various aspects of video understanding, including motion, spatial relationships, temporal relationships, and open-ended queries, offering robust test cases for evaluating VideoQA models.

Experimental Setup

The authors conducted extensive experiments using ActivityNet-QA to evaluate baseline VideoQA models. They implemented several baseline models including an extension of VQA (Visual Question Answering), a memory network model, and a soft attention model. The testing involved different strategies for video feature representation, focusing on optimizing the extraction and use of video features for these complex VideoQA tasks. Performance was measured using metrics like accuracy and WUPS (Word Uni-gram Propagation Score), providing a solid benchmark for future research.

Results and Insights

The baseline results demonstrate the challenges inherent in the ActivityNet-QA dataset, particularly in questions involving temporal reasoning where existing models struggle. Despite the difficulty, the paper shows that models utilizing dynamic sampling strategies outperform those with fixed sampling, indicating that recognizing crucial moments or frames in videos is crucial for improving VideoQA performance. The paper reveals the need for more advanced methods that can more effectively capture spatio-temporal dependencies within videos.

Implications and Future Directions

The introduction of ActivityNet-QA marks a significant step toward developing more sophisticated VideoQA systems. This dataset has the potential to guide the development of models that can handle real-world complexity in video content, pushing the boundaries of multimodal learning that combines vision and language. As future work, the dataset's bilingual QA pairs may encourage multilingual VideoQA endeavors, broadening the scope of research to consider cross-lingual and cultural nuances in video understanding. Additionally, integrating auxiliary information such as dense video captions could further enhance the ability of models to comprehend intricate video narratives.

ActivityNet-QA sets a new standard for VideoQA datasets, providing a comprehensive resource that emphasizes human annotation and diverse, complex video scenarios. This dataset presents the research community with numerous opportunities to develop and refine models capable of understanding and reasoning about the richly detailed content of web videos.

PDF Markdown

Related Papers

GitHub

GitHub - MILVLG/activitynet-qa: An VideoQA dataset based on the videos from ActivityNet (67 stars)