Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? (1711.09577v2)

Published 27 Nov 2017 in cs.CV

Abstract: The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available. https://github.com/kenshohara/3D-ResNets-PyTorch

PDF Abstract

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Summary

The paper entitled "Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?" tackles the viability of training very deep convolutional neural networks (CNNs) equipped with spatiotemporal three-dimensional (3D) kernels using existing video datasets. Despite considerable advancements in the performance of 3D CNNs in action recognition, prior research has predominantly engaged shallow 3D architectures. This paper evaluates the feasibility and effectiveness of both shallow and very deep 3D CNN architectures across multiple video datasets to derive insights into their potential.

Key Findings

Overfitting in Small Datasets: The paper starts with training a relatively shallow ResNet-18 on UCF-101, HMDB-51, and ActivityNet, demonstrating significant overfitting, which suggests these datasets are too small for training deep 3D CNNs from scratch. In contrast, Kinetics dataset is large enough to avoid such overfitting.
Kinetics Dataset for Deep 3D CNNs: Kinetics dataset supports the training of much deeper 3D CNNs. Results indicate that as the depth of ResNet models increases, accuracy on the Kinetics validation set improves up to 152 layers (ResNet-152). Beyond this, such as ResNet-200, the gains plateau, aligning with observations in 2D ResNet architectures trained on ImageNet.
Architectural Comparison: Among various network architectures, ResNeXt-101 achieved 65.1% top-1 accuracy and 85.7% top-5 accuracy. When input frames were increased, ResNeXt-101 further improved performance, outperforming the baselines set by previous state-of-the-art models.
Performance on Transfer Learning: Fine-tuning Kinetics-pretrained networks on smaller datasets such as UCF-101 and HMDB-51 demonstrated superior performance (ResNeXt-101 attained 94.5% on UCF-101 and 70.2% on HMDB-51), showing the effectiveness of leveraging large-scale pretraining for further enhancement.

Implications

Practical Implications:

Action Recognition: The results affirm the utility of Kinetics as a benchmark similar to ImageNet, suggesting that deep 3D CNNs can substantially advance action recognition tasks. Simple 3D models pretrained on Kinetics outperform more complex 2D models.
Scalability: Effective use of existing computational resources for training deeper and more sophisticated models suggests a strategic pivot towards extensive datasets like Kinetics for video understanding tasks.

Theoretical Implications:

Network Depth and Dataset Scale: This paper highlights a critical interdependence between model depth and data scale in the context of 3D CNNs for video data, reinforcing concepts from the evolutionary progress of image-based deep learning.
Spatiotemporal Feature Integration: Utilizing spatial and temporal convolutions together provides more nuanced and comprehensive feature representation, imperative for tasks involving dynamic content such as videos.

Future Developments

The significance of this research extends beyond action recognition:

General Video Analysis: Future investigations may explore applications in video summarization, anomaly detection, and temporal event localization using deep 3D CNNs and extensive pretrained models.
Optical Flow and Motion Analysis: The adoption of 3D CNNs could provide refined methods for detecting and estimating optical flow, crucial for interpreting motion dynamics in videos.

Additionally, continued advancements in computational efficiencies, such as optimized parallel processing algorithms and resource allocation strategies, may be necessary to facilitate the training of increasingly deeper networks. As the availability of large-scale video datasets grows, the potential of 3D CNNs in reshaping various computer vision applications will likely become more pronounced. Further research should also consider integrating multimodal learning to merge insights from visual, auditory, and textual data for richer semantic understanding of video contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Kensho Hara (12 papers)
Hirokatsu Kataoka (55 papers)
Yutaka Satoh (18 papers)

Citations (1,826)

View on Semantic Scholar

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? (1711.09577v2)