CinePile: A Long Video Question Answering Dataset and Benchmark (2405.08813v3)

Published 14 May 2024 in cs.CV, cs.LG, and cs.MM

Abstract: Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset. The findings indicate that although current models underperform compared to humans, fine-tuning these models can lead to significant improvements in their performance.

PDF Abstract

Introducing CinePile: A New Benchmark for Long-Form Video Understanding

Background and Motivation

Understanding long-form videos is no easy task, primarily because it involves comprehending not just individual frames, but also the temporal progression and complex interactions within the scenes. Existing datasets often miss this mark by allowing models to achieve high performance through analysis of just a few frames. This is where CinePile comes into play, offering a dataset that brings authentic long-form video comprehension challenges to the forefront.

CinePile: What Makes It Different?

CinePile stands out in several key aspects:

Dataset Size and Diversity:
- Contains around 305,000 multiple-choice questions (MCQs) derived from 9396 video clips.
- These questions are diversified across topics such as temporal comprehension, human-object interactions, and scene reasoning.
Complexity and Difficulty:
- The dataset emphasizes challenging questions that test the latest video-centric LLMs heavily.
- Human evaluators outperform top commercial models by approximately 26% and open-source models by a staggering 70%.
Automatic Question Generation:
- Utilizes a novel pipeline for generating questions with the help of advanced LLMs, which ensures high diversity and complexity.
- Takes advantage of human-in-the-loop methods involving audio descriptions, transcriptions, and detailed annotations to create well-rounded questions.

Creating CinePile

Data Collection

The dataset is sourced from publicly available movie clips, predominantly from YouTube's MovieClips channel, and is supplemented with audio descriptions from AudioVault and meta-data from IMDb.

Automated Questions Generation

Here's how the question generation process unfolds:

Scene Localization:
- Transcribe audio descriptions and align them with video clips using tools like WhisperX for accurate contextual matching.
- Extract relevant segments from the audio descriptions to serve as context for questions.
Question Templates:
- Start with 30,000 manually curated questions to generate templates via GPT-4.
- Create clusters of these questions and refine them into 86 unique templates, categorized into themes like Character and Relationship Dynamics, Narrative and Plot, Setting and Technical Analysis, and Thematic Exploration.
Generation Pipeline:
- Shortlist the most relevant templates for each scene.
- Use LLMs to generate detailed MCQs based on these templates, adding rationale to ensure quality.
- Implement a filtering process to remove trivial or poorly constructed questions.

Quality Assurance and Human Study

Before finalizing, the dataset underwent a rigorous quality check:

Conducted a human paper involving 25 participants to answer questions about randomly selected clips.
Identified and resolved systemic issues by analyzing the questions that participants got wrong.

Performance and Model Evaluation

CinePile puts existing models to the test:

Strong Results: The best commercial models achieve around 60% accuracy, yet lag behind human performance (73%) and even further behind very careful human annotations (86%).
Model Trends: State-of-the-art models like GPT-4 Vision and Gemini Pro Vision lead the pack, but still show significant room for improvement.
Open Source Performance: Open-source models like Video-ChatGPT and MovieChat showed much lower performance, emphasizing the need for robust training on comprehensive datasets like CinePile.

Implications and Future Directions

Practical Impact:

CinePile provides a robust benchmark for evaluating and improving video understanding models.
It highlights the gaps between human performance and model capabilities, focusing research efforts on these challenges.

Theoretical Impact:

The dataset reinforces the importance of temporal understanding and multimodal reasoning in AI.
It opens up opportunities for further research to bridge performance gaps, particularly in generating and synthesizing high-quality training data.

Concluding Thoughts

CinePile is a significant step towards tackling the complexities of long-form video understanding. By providing a diverse and challenging dataset, it not only sets a new benchmark but also paves the way for future advancements in video-centric AI models. As models get trained and fine-tuned on datasets like CinePile, the dream of machines genuinely understanding long, complex narratives gets a little closer to reality.

For those interested in exploring CinePile further, you can access the dataset and related artifacts here.