Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition (1708.07632v1)

Published 25 Aug 2017 in cs.CV

Abstract: Convolutional neural networks with spatio-temporal 3D kernels (3D CNNs) have an ability to directly extract spatio-temporal features from videos for action recognition. Although the 3D kernels tend to overfit because of a large number of their parameters, the 3D CNNs are greatly improved by using recent huge video databases. However, the architecture of 3D CNNs is relatively shallow against to the success of very deep neural networks in 2D-based CNNs, such as residual networks (ResNets). In this paper, we propose a 3D CNNs based on ResNets toward a better action representation. We describe the training procedure of our 3D ResNets in details. We experimentally evaluate the 3D ResNets on the ActivityNet and Kinetics datasets. The 3D ResNets trained on the Kinetics did not suffer from overfitting despite the large number of parameters of the model, and achieved better performance than relatively shallow networks, such as C3D. Our code and pretrained models (e.g. Kinetics and ActivityNet) are publicly available at https://github.com/kenshohara/3D-ResNets.

Authors (3)

Kensho Hara (12 papers)
Hirokatsu Kataoka (55 papers)
Yutaka Satoh (18 papers)

Citations (568)

View on Semantic Scholar

Summary

An Analysis of "Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition"

The paper "Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition," authored by Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh, addresses the challenge of video-based action recognition by leveraging the architectural advantages of residual networks (ResNets) extended to three dimensions. The research is situated in the rapidly expanding domain of computer vision, where human action recognition is of significant interest due to its applications in surveillance, video indexing, and human-computer interaction.

Spatio-Temporal Features and Network Architecture

Convolutional Neural Networks (CNNs) have revolutionized pattern recognition tasks, predominantly through 2D convolutions. However, recognizing actions in a video requires not only spatial understanding from individual frames but also temporal comprehension across frames. The paper explores 3D CNNs that utilize 3D convolutional kernels to capture these spatio-temporal features. While traditional 3D CNNs can overfit due to their numerous parameters, this research proposes the utilization of very deep architectures similar to ResNets, which have demonstrated success in 2D applications.

3D ResNets leverage shortcut connections, facilitating signal bypass and easing the training of these deeper networks. The paper presents detailed experiments using two configurations—an 18-layer and a 34-layer architecture—highlighting how deeper networks can be trained effectively on large-scale video datasets such as Kinetics.

Dataset Utilization and Training Methodology

The research utilizes significant video datasets such as ActivityNet and Kinetics. Specifically, Kinetics, with its high-quality annotations and larger scale compared to datasets like UCF101 and HMDB51, mitigates overfitting and allows for more robust training of 3D CNNs.

During training, stochastic gradient descent with momentum is employed. Data augmentation is performed through randomly generated training samples and multi-scale cropping. For evaluation, predictions are made using a sliding window over video frames, and recognition is performed by averaging class probabilities over clips.

Empirical Evaluation

Empirical results showcase the superiority of the proposed 3D ResNets over shallower architectures like C3D, particularly in the large-scale context provided by Kinetics. The paper found that the 3D ResNet-34 outperformed the C3D and achieved competitive results when compared to state-of-the-art architectures like the I3D model without ImageNet pretraining.

Despite these successes, the paper discusses underfitting in smaller scale datasets like ActivityNet, revealing that the architecture's depth necessitates a substantial amount of data for effective training. The implication is that deeper models equipped with batch normalization and scalable data preprocessing can potentially achieve further gains in accuracy.

Implications and Future Directions

This research solidifies the applicability of residual architectures in 3D convolution-based tasks. It opens pathways for addressing overfitting in high-parameter CNNs by expanding dataset scale and harnessing deep architectural benefits. The findings suggest that leveraging residual networks with 3D kernels could lead to more accurate and reliable video recognition systems.

Going forward, the paper suggests exploring deeper models like ResNet-50 and ResNet-101, as well as experimenting with other architectures such as DenseNets. The implementation could be coupled with more computational resources to test larger batch sizes, which have been shown to enhance performance in network training with batch normalization.

Overall, the paper makes a compelling case for the re-evaluation of neural network architectures in video action recognition and encourages further exploration into how architecture depth can be balanced with dataset breadth to achieve optimal learning outcomes.

PDF Markdown