Condensed Movies: Story Based Retrieval with Contextual Embeddings (2005.04208v2)

Published 8 May 2020 in cs.CV

Abstract: Our objective in this work is long range understanding of the narrative structure of movies. Instead of considering the entire movie, we propose to learn from the `key scenes' of the movie, providing a condensed look at the full storyline. To this end, we make the following three contributions: (i) We create the Condensed Movies Dataset (CMD) consisting of the key scenes from over 3K movies: each key scene is accompanied by a high level semantic description of the scene, character face-tracks, and metadata about the movie. The dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use. It is also an order of magnitude larger than existing movie datasets in the number of movies; (ii) We provide a deep network baseline for text-to-video retrieval on our dataset, combining character, speech and visual cues into a single video embedding; and finally (iii) We demonstrate how the addition of context from other video clips improves retrieval performance.

Authors (4)

Max Bain (15 papers)
Arsha Nagrani (62 papers)
Andrew Brown (31 papers)
Andrew Zisserman (248 papers)

Citations (95)

View on Semantic Scholar

Summary

The paper presents a story-based retrieval framework that condenses movies into key scenes using contextual and character embeddings.
It employs adjacent scene context to improve text-to-video retrieval accuracy, significantly outperforming traditional methods.
The introduction of the CMD dataset, covering 44% of movie plots in just 15% of screen time, paves the way for scalable video summarization.

Condensed Movies: Story Based Retrieval with Contextual Embeddings

The paper "Condensed Movies: Story Based Retrieval with Contextual Embeddings" by Bain et al. presents a novel approach to understanding the narrative structure of movies by focusing on "key scenes" rather than the entire film. This paper introduces the Condensed Movies Dataset (CMD), paving the way for advanced methodologies in text-to-video retrieval.

Contributions and Dataset

The authors propose a comprehensive framework for story-based retrieval, leveraging condensed movie representations through the CMD. The CMD consists of over 3,000 films with selected key scenes characterized by high-level semantic descriptions, character face-tracks, and movie metadata. This dataset surpasses existing ones by an order of magnitude in the number of movies, offering scalable, automatically curated data sourced from YouTube. Importantly, CMD offers extensive coverage, encompassing approximately 44% of typical movie plots while representing only about 15% of the movie in duration. Such a dataset is pivotal for developing and evaluating methods in long-range narrative understanding.

Methodology

Text-to-Video Retrieval: The paper establishes a baseline model for text-to-video retrieval by learning video embeddings that integrate character, speech, and visual cues. A deep network baseline amalgamates these cues into a single cohesive embedding, fostering improved retrieval performance.
Contextual Embeddings: The integration of context from preceding and succeeding video clips is crucial to the proposed methodology. This contextual addition enhances the understanding of individual clips, allowing for retrieval methods that align more closely with the narrative progression in movies.
Character Module: An innovative character embedding module is introduced, allowing the system to reason about character identities in both clip descriptions and videos. This module is particularly beneficial for within-movie retrieval where character continuity is central to narrative comprehension.

Results and Implications

The paper rigorously evaluates the proposed retrieval models using the CMD, demonstrating that contextual embeddings significantly boost retrieval accuracy. Specifically, the best-performing models achieve notable improvements in recall rates both in cross-movie and within-movie retrieval scenarios. This highlights the benefit of using character identity models and contextual data from adjacent scenes in video understanding tasks.

The introduction of CMD and the development of sophisticated retrieval models have significant implications. Practically, these advancements contribute to enhanced capabilities in movie searching and indexing, potentially facilitating intelligent systems for semantic video summarization and automatic video descriptions for accessibility purposes.

Future Outlook

This research suggests several avenues for future exploration. Further leveraging plot summaries for contextual modeling could refine retrieval systems, and extending the framework to incorporate wider datasets would generalize the approach across various narrative forms. Additionally, advancements in deep learning models, particularly those that capture relational data over extended temporal sequences, could provide deeper insights into character development and plot dynamics.

In conclusion, Bain et al. present substantial advancements in computational movie understanding, underscoring the importance of contextual data in narrative parsing and retrieval tasks. The CMD stands as a critical resource for furthering research into the semantic analysis of video content, marking an essential step in bridging narrative storytelling with computational techniques.

PDF Markdown

Related Papers

YouTube

Show All Videos