The Kinetics Human Action Video Dataset (1705.06950v1)

Published 19 May 2017 in cs.CV

Abstract: We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset. We also carry out a preliminary analysis of whether imbalance in the dataset leads to bias in the classifiers.

Citations (3,571)

View on Semantic Scholar

Summary

The paper introduces a high-variation video dataset with 400 human action classes, each having at least 400 unique 10-second clips.
It details a multi-step construction process including YouTube clip acquisition, human validation via AMT, and rigorous de-duplication.
Benchmark tests on two-stream networks report top-1 and top-5 accuracies of 56.0% and 81.3%, highlighting the dataset's complexity.

The Kinetics Human Action Video Dataset: An Overview

The paper "The Kinetics Human Action Video Dataset" introduces a substantial effort to bridge the gap in datasets available for large-scale human action classification in videos. Authored by a team from Google DeepMind, the paper elucidates the creation and potential applications of what is now known as the Kinetics dataset.

Dataset Characteristics

The Kinetics dataset consists of 400 human action classes, with no less than 400 video clips per action class. Each video clip has a duration of approximately 10 seconds, sourced from unique YouTube videos to ensure diverse coverage. The range of actions encapsulated in Kinetics spans both human-object interactions (e.g., playing musical instruments) and human-human interactions (e.g., shaking hands).

Comparison to Existing Datasets

The constraints of previous datasets like HMDB-51 and UCF-101, which have become outdated due to their limited size and variation, are documented in detail. For instance, UCF-101, although it offers 101 action categories, uses merely 2.5k distinct videos, leading to repetitive clips. In contrast, Kinetics achieves higher variation by using unique source videos for each clip.

Construction and Collection Methodology

The paper elaborates on the multi-step process followed to compile the dataset:

Action List Compilation: Combining various sources such as existing action datasets and motion capture datasets, and soliciting suggestions from Mechanical Turk workers.
Candidate Clip Acquisition: Utilizing YouTube video titles and then pinpointing potential action sequences within videos using trained image classifiers.
Human Validation: Engaging Amazon Mechanical Turk (AMT) workers to confirm the presence of actions, ensuring high-quality annotations.
De-duplication and Cleaning: Implementing feature-based techniques to remove duplicate video content and refining class labels to eliminate semantic overlaps or confusions.

Balance and Bias Consideration

The issue of bias, particularly gender imbalance within action categories, is acknowledged. For example, actions such as "shaving beard" and "cheerleading" demonstrate gender-specific dominance. However, initial analyses reveal no significant classifier bias favoring any particular gender within these imbalanced classes. The analysis also ventures into other biases like age and race, suggesting minimal concern, with noted exceptions warranting further exploration.

Benchmark Performance Analysis

To evaluate the effectiveness of the Kinetics dataset, the paper explores three primary ConvNet architectures:

ConvNet + LSTM: Integrating an LSTM layer to encode temporal sequence information.
Two-Stream Networks: Combining RGB and optical flow streams for capturing appearance and motion information.
3D ConvNets: Employing spatio-temporal filters to create hierarchical video representations.

Training these architectures on Kinetics reveals substantially lower performance compared to training on older datasets like UCF-101, signifying Kinetics' higher difficulty and potential to challenge more advanced techniques. For example, top-1 and top-5 accuracies stand at 56.0% and 81.3% respectively using the Two-Stream networks, highlighting the dataset's complexity.

Practical Implications and Future Work

The Kinetics dataset is poised to fuel advancements in designing robust action recognition architectures, incorporating multi-modal information streams (RGB, flow, pose, etc.), and refining temporal aggregation mechanisms. Future work includes delving deeper into biases, possibly collaborating with social scientists and critical humanists, and leveraging trained models to generate features for newly defined action classes.

By ensuring a significant leap in data scale and variety for action classification, this dataset sets a new benchmark, promising to serve the computer vision community extensively. The detailed documentation and methods offer a clear blueprint for realizing large-scale, high-quality datasets essential for propelling research in human action recognition forward.

PDF Markdown