Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (1705.07750v3)

Published 22 May 2017 in cs.CV and cs.LG

Abstract: The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics. We also introduce a new Two-Stream Inflated 3D ConvNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0% on UCF-101.

PDF Abstract

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Carreira and Zisserman's paper, titled "Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset," advances the field of action recognition in video data by addressing several key limitations of existing datasets and proposing a novel architecture, the Two-Stream Inflated 3D ConvNet (I3D). This essay provides a comprehensive overview of their methods, experimental results, and the broader implications of their research.

Limitations of Existing Datasets and the Introduction of Kinetics

Current action classification datasets such as UCF-101 and HMDB-51 are limited by their relatively small size, comprising on the order of 10k videos each. This paucity has hindered the identification of optimal video architectures, as most methods achieve similar performance on these benchmarks. To address this issue, Carreira and Zisserman introduce the Kinetics Human Action Video dataset. Kinetics substantially expands the volume and diversity of action recognition data, featuring 400 action classes and more than 400 clips per class, all collected from complex, realistic YouTube videos.

Evaluating State-of-the-Art Architectures

The paper reevaluates several prominent action classification models in light of the new Kinetics dataset. The investigated architectures include:

ConvNet + LSTM: Integrates a recurrent neural network (LSTM) for modeling temporal sequences of frame-based features.
3D ConvNets: Employ three-dimensional convolutions to model spatio-temporal data.
Two-Stream Networks: Utilize separate streams for RGB frames and optical flow, later fusing their outputs.
Two-Stream I3D: Inflates 2D ConvNets into 3D by converting their filters and pooling kernels to three dimensions, benefiting from pre-trained ImageNet architectures.

Experimental Results

The researchers conducted extensive experiments to benchmark these architectures on Kinetics as well as the smaller UCF-101 and HMDB-51 datasets. Key numerical results include:

The I3D model reaches an accuracy of 74.2% on Kinetics when combining RGB and optical flow streams.
Pre-training on Kinetics yields significant improvements for other datasets, with I3D achieving 80.9% on HMDB-51 and 98.0% on UCF-101.
This pre-trained I3D model substantially outperforms previous state-of-the-art methods, providing a robust baseline for future action recognition studies.

Implications and Future Directions

The findings emphasize the importance of large-scale, diverse datasets like Kinetics for effectively training deep video architectures. The introduction of I3D also demonstrates that extending 2D convolutional networks to 3D can leverage pre-existing image classification capabilities while capturing spatio-temporal dynamics. This hybrid approach opens new avenues for improving video models by integrating the efficiency and power of 2D ConvNets with enhanced temporal modeling.

The research indicates that transfer learning from large video datasets to other tasks, such as video segmentation or object detection, could be highly beneficial. Future models might further refine the spatio-temporal interplay by incorporating mechanisms of attention or actor-specific action tubes.

In conclusion, Carreira and Zisserman's work presents a significant step forward in action recognition, offering both a substantial dataset and a novel model architecture that sets new benchmarks for performance. The implications of their paper extend beyond action classification, promising advancements across various video-based tasks in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Andrew Zisserman (248 papers)
Joao Carreira (20 papers)

Citations (7,513)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos