Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

MovieQA: Understanding Stories in Movies through Question-Answering (1512.02902v2)

Published 9 Dec 2015 in cs.CV and cs.CL

Abstract: We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- video clips, plots, subtitles, scripts, and DVS. We analyze our data through various statistics and methods. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We make this data set public along with an evaluation benchmark to encourage inspiring work in this challenging domain.

PDF Abstract

MovieQA: Evaluating Story Comprehension in Movies through Question-Answering

The paper "MovieQA: Understanding Stories in Movies through Question-Answering" introduces the MovieQA dataset, which is designed to evaluate automatic story comprehension from both video and text sources. The dataset consists of 14,944 questions about 408 movies with high semantic diversity and provides multiple sources of information including plots, subtitles, video clips, scripts, and DVS transcriptions.

Dataset Composition

The MovieQA dataset is unique in its integration of various data sources:

Plot Synopses: Extended summaries obtained from Wikipedia.
Video and Subtitles: Extracted from movies, providing visual and dialog information.
DVS Transcriptions: Narrations for the visually impaired, providing detailed scene descriptions.
Scripts: Collected from IMSDb, providing detailed scene information and dialogue.

The dataset is structured to support two primary evaluation tasks: text-based QA and video-based QA. For the text-based task, the story can take different textual forms (plots, subtitles, scripts, or DVS transcriptions). For the video-based task, the story comprises video clips, and optionally subtitles, allowing a comprehensive assessment of the QA models.

Data Collection Method

To ensure high-quality questions and answers, the data collection process was divided into two steps. Initially, annotators generated questions and correct answers by referring to plot synopses. They also marked minimal sets of sentences within the plot that justified the questions and answers. In the second step, the multiple-choice answers were generated, including the correct one and four deceiving ones to challenge the QA systems.

Additionally, the dataset includes timestamp annotations for 6,462 QAs, indicating the location of the question and answer within the video. This allows the evaluation of models in a temporally coherent manner.

Intelligent Baselines and QA Methods

The paper explores various intelligent baselines and extends existing QA techniques to evaluate the complexity of the MovieQA dataset:

Hasty Student: Attempts to answer questions without referring to the story. Several strategies were tested:
- Answer length bias
- Within answer similarity or distinctiveness
- Question-answer similarity

Generally, these methods yielded poor performance compared to using contextual information from the story.

Searching Student: Attempts to locate relevant parts of the story to answer questions, using cosine similarity and convolutional neural networks (CNNs) to match questions and answers with story segments. This approach achieved better results by leveraging various text representations like TF-IDF, Word2Vec, and SkipThoughts.
Memory Networks: Originally proposed for text QA, the authors modified the Memory Networks to handle natural language answers and large vocabularies. This approach showed promise, particularly when using scripts, which contain rich contextual information from both descriptions and dialogues.

Results and Analysis

The evaluation reveals several insights:

Plot synopses often yield the best performance for QA tasks due to the richness and specificity of the summaries.
Multi-source information fusion improves performance, highlighting the importance of integrating diverse data sources.
Memory Networks and the CNN approach (SSCB) excel in handling the complexity of multi-choice QA, particularly for longer, more detailed text sources like scripts.

For video-based QA, the challenge is more complex, as evidenced by the lower performance when models rely solely on visual information. However, integrating subtitles with video improves results, demonstrating the critical role of dialog in comprehension.

Implications and Future Work

The MovieQA dataset sets a challenging benchmark for evaluating story comprehension in movies, pushing the boundaries of current QA models. The integration of multiple data sources provides a comprehensive testbed for models that aim to understand high-level semantics, motivations, and emotions portrayed in movies.

Future developments in AI could leverage this dataset to enhance applications such as assistive technologies for the visually impaired and cognitive robotics. By fostering advancements in deep learning models that integrate vision and language, MovieQA contributes towards a holistic understanding of multimedia content.

Conclusion

The MovieQA dataset represents a significant step in evaluating automatic story comprehension through question-answering. By integrating multiple sources of information and providing a robust benchmark, it challenges existing models and sets the stage for innovative research in the fields of computer vision, natural language processing, and machine learning.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Makarand Tapaswi (41 papers)
Yukun Zhu (33 papers)
Rainer Stiefelhagen (155 papers)
Antonio Torralba (178 papers)
Raquel Urtasun (161 papers)
Sanja Fidler (184 papers)

Citations (699)

View on Semantic Scholar