Open-Vocabulary Video Relation Extraction (2312.15670v1)

Published 25 Dec 2023 in cs.CV

Abstract: A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and relationships that shape the nature of the action, resulting in a superficial understanding of the action. Motivated by this, we introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets. OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages. Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a crossmodal mapping model to generate relation triplets as a sequence. Finally, we benchmark existing cross-modal generation models on the new task of OVRE.

References (50)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces OVRE, a novel task that extracts detailed action-centric relation triplets to capture nuanced actor-action relationships.
It leverages a cross-modal mapping model that uses the CLIP visual encoder and a pre-trained LLM to translate video semantics into natural language.
The Moments-OVRE dataset, comprising 180,000 videos with unrestricted vocabulary annotations, significantly advances the field of video understanding.

The paper "Open-Vocabulary Video Relation Extraction" introduces a novel task designed to enhance video understanding by focusing on action-centric relation triplets. This task, termed Open-vocabulary Video Relation Extraction (OVRE), seeks to transcend traditional action classification methods that often overlook the nuanced actors and relationships involved in video actions. Instead, OVRE emphasizes the extraction of pairwise relations in videos and describes these relation triplets using natural language.

Key elements of the paper include the introduction of the Moments-OVRE dataset, a vast collection consisting of 180,000 videos annotated with action-centric relation triplets. This dataset is derived from the Multi-Moments in Time (M-MiT) dataset, which is known for its multi-label nature and brief video durations, typically around three seconds.

The authors propose a methodology involving a cross-modal mapping model that utilizes the CLIP visual encoder to encapsulate video semantics, which are then translated into linguistic expressions using a pre-trained LLM. This approach allows the generation of unconstrained vocabulary relation triplets, effectively moving beyond the limitations of fixed vocabulary sets typically found in related tasks such as Video Visual Relation Detection (VidVRD) and Action Genome. These conventional tasks often fail to capture the full complexity of actions and their contexts, as they restrict objects and predicates to limited categories.

In benchmarking for OVRE, existing cross-modal generation models are evaluated, setting the stage for further research and development in video relation extraction. By leveraging the expansive vocabulary capabilities of LLMs, the authors aim to capture more dynamic and nuanced actions within videos, offering a deeper understanding of video content. The Moments-OVRE dataset is also positioned as the most extensive video relation extraction dataset, featuring:

Unrestricted Vocabulary Annotations: Annotations encompass diverse actors and relations, providing a more accurate representation of real-world scenarios.
Focus on Action-Centric Annotations: The emphasis is on annotations relevant to video actions/events.
Scalability: With over 180,000 videos, it marks a significant contribution to video relation extraction datasets.

In conclusion, the paper presents OVRE as a step forward in video understanding, linking general action classification and detailed linguistic description through a contextual level comprehension of video content. The dataset and new task formulation proposed in this paper are poised to drive forward the field of automatic video understanding by providing a richer framework for modeling human-like comprehension of video scenes.

PDF Markdown

Open-Vocabulary Video Relation Extraction (2312.15670v1)

Summary

Related Papers