AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions (1705.08421v4)

Published 23 May 2017 in cs.CV

Abstract: This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.

PDF Abstract

An Analysis of the AVA Dataset for Spatio-temporally Localized Atomic Visual Actions

The paper "AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions" presents a comprehensive video dataset aimed at advancing action recognition research. Authored by a team of researchers from Google Research, Inria, and UC Berkeley, the paper meticulously details the creation, characteristics, and potential impact of the AVA dataset. This essay will provide a structured overview of the dataset’s significant contributions, the novel approach for action localization, and implications for future AI developments.

Dataset Characteristics

The AVA dataset introduces a robust and nuanced collection of video clips annotated with a focus on spatio-temporally localized atomic actions. Here are the defining properties of AVA:

Rich Annotations: The dataset encompasses 430 15-minute video clips, with dense annotations covering 80 atomic actions across 1.58 million action instances.
Person-Centric Annotation: Each video frame is annotated at a granular level of 1 Hz, with bounding boxes outlining each individual and their respective actions.
Exhaustive Labeling: Unlike previous datasets that sparsely annotate composite actions in brief clips, AVA provides exhaustive annotations over more extended periods, fostering a more accurate representation of realistic scene and action complexity.
Temporal Context: The dataset leverages short segments ( $\pm$ 1.5 seconds on either side of the keyframe) allowing annotators to use temporal cues for disambiguation.
Diverse Action Vocabulary: Actions are categorized with high granularity, facilitating fine-scale annotation that can differentiate subtle differences between actions such as "touch" versus "hold."

Methodology for Data Collection and Annotation

The data collection and annotation process for AVA is multi-faceted, involving initial automated detection and subsequent human verification:

Action Vocabulary: An extensive list of atomic actions was devised to capture generic yet comprehensive actions in varied environments.
Movie Selection: The dataset uses clips from globally sourced movies, selected to maximize diversity and avoid over-fitting to specific genres or contexts.
Bounding Box Annotation: Incorporates both automated detections and manual corrections to ensure high fidelity in tracking individuals across frames.
Action Annotation: An innovative two-stage propose-and-verify approach enhances recall rates for annotating actions, especially for those with sparser examples.

Benchmarking and Comparative Analysis

The dataset’s efficacy is demonstrated through various experiments, including benchmarking against existing action recognition datasets such as JHMDB and UCF101-24. While state-of-the-art methods achieve high performance on these traditional benchmarks, the AVA dataset presents a notable challenge, with a frame-level mean Average Precision (mAP) of only 15.6%. This discrepancy underscores the intrinsic difficulty of atomic action recognition due to the dataset's complexity and the necessity for advanced models to interpret fine-grained spatio-temporal actions.

Implications and Future Directions

The introduction of the AVA dataset holds substantial implications for both practical applications and theoretical advancements in AI:

Enhanced Action Recognition Models: The low baseline performance on AVA indicates significant room for improvement in existing models. Researchers must focus on developing algorithms capable of parsing the subtle nuances of atomic actions and integrating richer temporal context.
Broader Application Scope: With its exhaustive and nuanced annotations, AVA facilitates the training of models that could be applied in various domains, including surveillance, autonomous driving, and human-computer interaction.
Richer Temporal Models: Future research should explore the use of advanced temporal modeling techniques, potentially incorporating recurrent neural networks or transformers to better capture the evolution of actions over time.

The AVA dataset represents a significant contribution to the field of action recognition, providing a rich resource that pushes the boundaries of current methodologies. It invites the research community to address the complexities of fine-grained action understanding, fostering advancements that will be crucial in developing AI systems with a more profound understanding of human activity and behavior.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Chunhui Gu (6 papers)
Chen Sun (187 papers)
David A. Ross (27 papers)
Carl Vondrick (93 papers)
Caroline Pantofaru (15 papers)
Yeqing Li (17 papers)
Sudheendra Vijayanarasimhan (15 papers)
George Toderici (22 papers)
Susanna Ricco (10 papers)
Rahul Sukthankar (39 papers)
Cordelia Schmid (206 papers)
Jitendra Malik (210 papers)

Citations (967)

View on Semantic Scholar